`
hax
  • 浏览: 951599 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

写对正则:一行代码,速度差50倍

    博客分类:
  • JS
阅读更多

2009-05-11

A lesson of RegExp: 50x faster with just one line patch

While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter , which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH .

But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.

It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names like Array , Boolean , String , etc., and properties and methods like alert , charAt , onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.

So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).

I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
},

The function GetKeywords is used to generate a regexp for keywords search and replacement. For example, GetKeywords("abstract break byte case catch") will return a regexp /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ .

The code is straightforward, but it's bad and generate a very inefficient regexp.

The keypoint is \b , \b is a word boundary assertion. To test whether a position is a word boundary, the regexp engine need to consider both the left character of the position and the right character of the position. If one is a word character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is not, then it's a word boundary. You see it need both look forward one char and look backward one char. Though \b assertion is not very expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ will do such look forward/backward 10 times, and JavaScript language has 50+ keywords means each failed match will do 100 times, and SyCODE add 400+ properties/methods words means extra 800+ times!

Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.

Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary . /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ can be rewrite as /\b(?:abstract|break|byte|case|catch)\b/ , those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two , even SyCODE add 400+ words, we still just need two .

It's trivial to fix GetKeywords :


GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';

 return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';

},

Let's see the result of applying this one line patch:

Test results of FF3 code lines original patched
700 6.7s 0.1s
1600 15.5s 0.3s
4300 41.5s 0.7s

Oops, one line code cause 50x difference.

Besides FF, the patch also help other browsers a lot.

Test results of 4300 lines of code browser original patched
IE6 6.3s 2.8s
Safari3 2.6s 0.8s
Opera9 8.2s 1.7s
Chrome1 2.3s 0.5s

As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.

SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the new SyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.

But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.

7
5
分享到:
评论
5 楼 sohighthesky 2010-07-25  
     学习
4 楼 terryang 2009-05-14  
yidao620c 写道



3 楼 yidao620c 2009-05-14  
2 楼 i_love_sc 2009-05-13  
帖子是英文的,没什么。反正能看懂。还用英文回帖就太……
1 楼 boin 2009-05-12  
aweson post!
greatly inspires me on digging more deeper into regexp.
can't wait to see the following posts.

相关推荐

    精通正则表达式~~~

    精通正则表达式第三版 搜集于网络 前言..........I 第1章:正则表达式入门.... 1 解决实际问题... 2 作为编程语言的正则表达式... 4 以文件名做类比... 4 以语言做类比... 5 正则表达式的知识框架... 6 对于...

    正则表达式

    这行代码创建一个新的RegExp对象,并将它赋给变量parttern.这个特殊的RegExp对象和所有以字母"s"结尾的字符串都匹配.用RegExp()也可以定义 一个等价的正则表达式,代码如下: var pattern = new RegExp("s$"); ...

    PHP实现正则表达式分组捕获操作示例

    本文实例讲述了PHP实现正则表达式分组捕获操作。分享给大家供大家参考,具体如下: ...写完了,得再检查一遍,这个遇到困难了,二三千行的代码,用眼睛一行一行查,那的比较累了,而且还不一定能检查

    精易模块[源码] V5.15

    4、改善“字节集_还原”优化子程序处理速度,感谢易友【落款hMZ】提供代码。 5、修改“窗口_热键注册”API正确申明,并把第五个参数【热键标识文本】改为热键ID。 6、新增“类_任务栏”可以显示隐藏任何第三方窗口...

    Java-PHP-C#

    此外,JavaScript这种客户端的脚本语言也提供了对正则表达式的支持,现在正则表达式已经成为了一个通用的概念和工具,被各类技术人员所广泛使用。 在某个Linux网站上面有这样的话:"如果你问一下Linux爱好者最喜欢...

    淘客必备,超级批量修改工具,批量修改pid代码等

    可自定义所支持的文件类型支持16进制替换,支持单行和多行以及段落替换支持特征替换和提取,支持正则替换支持多规则同时替换并可以行导入规则智能规则排序功能,支持多级目录大小写匹配、支持备份和恢复,替换速度快...

    精通qt4编程(源代码)

    \初级篇 第1章 Qt初步实践 卢传富 建立了第一个较简单的Qt应用程序,在GUI用户界面中显示一行中文。 2 \ 第2章 对话框 \——QDialog 卢传富介绍了Qt的对话框类QDialog,实现了一个自定义的登录对话框,举例说明了Qt...

    超实用的jQuery代码段

    11.13 向表格追加一行数据 11.14 获取客户端IP 11.15 向Firebug的控制面板发送消息 11.16 根据不同的屏幕大小显示不同的网页 11.17 jQuery遍历对象的属性 11.18 最优化的循环语句写法 11.19 如何构建最优化的字符串 ...

    精通Qt4编程(第二版)源代码

    \初级篇 第1章 Qt初步实践 卢传富 建立了第一个较简单的Qt应用程序,在GUI用户界面中显示一行中文。 2 \ 第2章 对话框 \——QDialog 卢传富介绍了Qt的对话框类QDialog,实现了一个自定义的登录对话框,举例说明了...

    Java项目源码之文本编辑器的实现.rar

    异步加载:对大型文本文件进行异步加载,避免阻塞界面线程,提高编辑器的响应速度。 缓存策略:采用合理的文本缓存策略,减少IO操作次数,提高文本编辑效率。 文本编辑器的实现旨在为用户提供一个功能丰富、易于使用...

    精易官方免费模块v3.60版

    2.重写“数组_排序”,速度提升256倍以上! 感谢会员 落雪 提供的参考代码 3.增加 DLL "lstrcmp" 目前应用于数组_排序 4.增加“数组_反转”,感谢会员 落雪 提供的参考代码 精易模块 V3.51 what’s new:...

    TXT杀手最终版本——很好用

    为了方便大家使用,特别针对【正则表达式】做如下整理说明,(在本程序中使用的话依次选择“按行分割”、“正则表达式”) 一、按照章查找分割 第[一二两三四五六七八九十○零百0-91234567890]{1,12}章 二、...

    详解Vue This$Store总结

    1 正则表达式,正则表达式大家都不陌生,正则表达式的优点就是速度快,对于一个代码补全的插件,用户肯定希望更快的得到反馈,测试了 1000 行的 vue 文件匹配全部的 mapXXX() 之类的 api 字符串也只是用了 1ms 都不...

    你必须知道的495个C语言问题

    中,如果不关心a[]的哪一个分量会被写入,这段代码就没有问题,i也的确会增加1,对吗? 3.11 人们总是说i=i++的行为是未定义的。可我刚刚在一个ANSI编译器上尝试过,其结果正如我所期望的。 3.12 我不想学习那些...

    入门学习Linux常用必会60个命令实例详解doc/txt

    上面代码中,第一行是Linux发行版本号,第二行是内核版本号和登录的虚拟控制台,我们在第三行输入登录名,按“Enter”键在Password后输入账户密码,即可登录系统。出于安全考虑,输入账户密码时字符不会在屏幕上回显...

    CURL用法大全

    最有可能的是您试图进入一个在此服务器上不存在的目录。 11:FTP 非正常的PASS回复。cURL无法解析发送到PASS请求的应答。 13:FTP 非正常的的PASV应答,cURL无法解析发送到PASV请求的应答。 14:FTP非正常的227格式。...

    《你必须知道的495个C语言问题》

    中,如果不关心a[]的哪一个分量会被写入,这段代码就没有问题,i也的确会增加1,对吗? 38  3.11 人们总是说i=i++的行为是未定义的。可我刚刚在一个ANSI编译器上尝试过,其结果正如我所期望的。 38  3.12 我不...

Global site tag (gtag.js) - Google Analytics