寻找Java中String.split性能更好的方法

String.split 是Java里很常用的字符串操作，在普通业务操作里使用的话并没有什么问题，但如果需要追求高性能的分割的话，需要花一点心思找出可以提高性能的方法。
String.split方法的分割参数regex实际不是字符串，而是正则表达式，就是说分隔字符串支持按正则进行分割，虽然这个特性看上去非常好，但从另一个角度来说也是性能杀手。
在Java6的实现里，String.split每次调用都直接新建Pattern对象对参数进行正则表达式的编译，再进行字符串分隔，而正则表达式的编译从字面上看就知道需要耗不少时间，并且实现中也没有对Pattern进行缓存，因此多次频繁调用的使用场景下性能很差，如果是要使用正则表达式分隔的话，应该自行对Pattern进行缓存。
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
但很多时候我们并不会真的想使用正则表达式分隔字符串，我们其实想的只是用一个简单的字符比如空格、下划线分隔字符串而已，为了需要是满足这个需求却要背上正则表达式支持的性能损耗，非常不值得。
因此在Java7的实现里，针对单字符的分隔进行了优化，对这种场景实现了更合适的方法。单字符不走正则表达式的实现，直接利用indexOf快速定位分隔位置，提高性能。
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx’s meta characters “.$|()[{^?*+\\”, or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
“.$|()[{^?*+\\”.indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == ‘\\’ &&
(((ch = regex.charAt(1))-‘0’)|(‘9’-ch)) < 0 &&
((ch-‘a’)|(‘z’-ch)) < 0 &&
((ch-‘A’)|(‘Z’-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit – 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit – 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};

// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));

// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize – 1).length() == 0)
resultSize–;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
有没有更快的方法？如果分隔符不是单字符而且也不需要按正则分隔的话，使用split的方法还会和Java6一样使用正则表达式。这里还有其他备用手段：

使用StringTokenizer,StringTokenizer没有正则表达式分隔的功能，单纯的根据分隔符逐次返回分隔的子串，默认按空格分隔，性能比String.split方法稍好，但这个类实现比较老，属于jdk的遗留类，而且注释上也说明不建议使用这个类。
使用org.apache.commons.lang3.StringUtils.split分隔字符串，针对不需要按正则分隔的场景提供更好的实现，分隔符支持字符串。

还能有更快的方法么？注意到String.split和StringUtils.split方法返回值是String[], 原始数组的大小是固定的，而在分隔字符串不可能提前知道分隔了多少个子串，那这个数组肯定藏了猫腻，看看是怎么实现的。
定位String.split单字符实现，发现分隔的子串其实保存在ArrayList里，并没有高深的技巧，直到路径的最后一行，代码对存储了子串的ArrayList再转成数组，而toArray的实现里对数组进行了复制。
return list.subList(0, resultSize).toArray(result);
StringUtils.split方法里同样也是这样。
return list.toArray(new String[list.size()]);
因此这里可以做一个优化，把代码实现复制过来，然后将方法参数返回类型改为List，减少数组复制的内存消耗。
还能有更快的方法么？其实很多时候我们需要对分隔后的字符串进行遍历访问做一些操作，并不是真的需要这个数组，这和文件读取是一样的道理，读文件不需要把整个文件读入到内存中再使用，完全可以一次读取一行进行处理，因此还可以做一个优化，增加参数作为子串处理方法的回调，在相应地方改为对回调的调用，这样能完全避免数组的创建。也就是说，把字符串分隔看做一个流。
private static void splitWorker(final String str, final String separatorChars, final int max, final boolean preserveAllTokens, Consumer<String> onSplit) {
if (str == null) {
return;
}
final int len = str.length();
if (len == 0) {
return;
}
int sizePlus1 = 1;
int i = 0, start = 0;
boolean match = false;
boolean lastMatch = false;
if (separatorChars == null) {
// Null separator means use whitespace
while (i < len) {
if (Character.isWhitespace(str.charAt(i))) {
if (match || preserveAllTokens) {
lastMatch = true;
if (sizePlus1++ == max) {
i = len;
lastMatch = false;
}
onSplit.accept(str.substring(start, i));
match = false;
}
start = ++i;
continue;
}
lastMatch = false;
match = true;
i++;
}
} else if (separatorChars.length() == 1) {
// Optimise 1 character case
final char sep = separatorChars.charAt(0);
while (i < len) {
if (str.charAt(i) == sep) {
if (match || preserveAllTokens) {
lastMatch = true;
if (sizePlus1++ == max) {
i = len;
lastMatch = false;
}
onSplit.accept(str.substring(start, i));
match = false;
}
start = ++i;
continue;
}
lastMatch = false;
match = true;
i++;
}
} else {
// standard case
while (i < len) {
if (separatorChars.indexOf(str.charAt(i)) >= 0) {
if (match || preserveAllTokens) {
lastMatch = true;
if (sizePlus1++ == max) {
i = len;
lastMatch = false;
}
onSplit.accept(str.substring(start, i));
match = false;
}
start = ++i;
continue;
}
lastMatch = false;
match = true;
i++;
}
}
if (match || preserveAllTokens && lastMatch) {
onSplit.accept(str.substring(start, i));
}
}

public static void split(final String str, final String separatorChars, Consumer<String> onSplit) {
splitWorker(str, separatorChars, -1, false, onSplit);
}

// 使用方法
public void example() {
split(“Hello world”, ” “, System.out::println);
}
还能有更快的方法么？也有更极端的优化方法，因为在拿子串（substring方法）时实际发生了一次字符串复制，因此可以把回调函数改为传入子串在字符串的区间start、end，回调再根据区间读取子串进行处理，但并不是很通用，这里就不展示代码了，有兴趣的可以试一下。
还能有更快的方…

寻找Java中String.split性能更好的方法

评论

发表回复取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

寻找Java中String.split性能更好的方法

评论

发表回复 取消回复

更多文章

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

深入解析：基于Delta的线性数据结构模型，打造高效富文本编辑器

轻松管理社交媒体：使用Automa插件实现一键拉黑功能

发表回复取消回复