zip压缩算法源码分析(1)
LZ77 压缩算法
这⾥的图都是⾃⼰做的,所以很抱歉,可能在⽂章页⾯看起来不怎么清楚,想仔细看细节的话,可以
在新页⾯打开图⽚,会清晰很多的。
I. 主要思路
LZ77算法压缩的是输⼊中重复出现的那些短语,每读到⼀个新字符,就向前扫描,看是否有重复出现的短语。若搜索出来有重复出现的短语,就⽤形如
(charactor(匹配串的起始字符), distance, length)的三元组来代替该匹配串,如图
为了保证压缩速度,LZ77 限制了向前搜索的范围,所有查重匹配⼯作都在⼀个“滑窗”内进⾏。
gzip在LZ77算法的基础上增加了不少的优化技巧。该过程主要在c⽂件deflate.c中(这整个算法就叫做deflate算法,deflate意为“放⽓”, 有种把车胎放⽓弄瘪的意思在⾥⾯,不知道这个理解对不对(⊙v⊙))。
II. deflate.c源码分析
源码的分析我打算分段来写,并且为了逻辑清晰起见,顺序与源码不同。
(i.)先贴上代码⾸部的注释:(好⽍我也是全英班的,就当练练英语吧。)
/*
* PURPOSE ⽬的
*
* Identify new text as repetitions of old text within a fixed-
* length sliding window trailing behind the new text.
* 这⾥的主要活动是在⼀个定长滑动窗⼝内进⾏字符串匹配并初次压缩输出的⼯作。
*
* DISCUSSION
*
* The "deflation" process depends on being able to identify portions
* of the input text which are identical to earlier input (within a
* sliding window trailing behind the input currently being processed).
* “deflation”算法依靠前后⽂中相同的字符串达到压缩的⽬的。
*
* The most straightforward technique turns out to be the fastest for
* most input files: try all possible matches and select the longest.
* 查所有匹配的串并且选出匹配长度最长的哪⼀个串进⾏压缩---
* 这种最直接粗暴的压缩技术,被证明在绝⼤多数输⼊情形下都是最快的。
* The key feature of this algorithm is that insertions into the string
* dictionary are very simple and thus fast, and deletions are avoided
* completely.
* 该算法的主要特性是对字典区块(就是滑动窗⼝中已经处理过的那部分
* 字符组成的字典,由哈希表实现)插⼊(新词条)的操作简洁且快速,
* 同时还完全地避免了删除(词条)的操作。
* Insertions are performed at each input character, whereas
* string matches are performed only when the previous match ends. So it
* is preferable to spend more time in matches to allow very fast string
* insertions and avoid deletions.
* insertions and avoid deletions.
* 每读到⼀个新字符都会进⾏⼀次(字典区块的)插⼊操作,⽽匹配操作仅当
* 之前的⼀次操作完成时才会开始进⾏。所以总是希望可以通过适当增加匹配操作
* 时间来减少插⼊操作的时间。
* The matching algorithm for small
* strings is inspired from that of Rabin & Karp. A brute force approach
* is used to find longer strings when a small match has been found.
* A similar algorithm is used in comic (by Jan-Mark Wams) and freeze
* (by Leonid Broukhis).
* A previous version of this file used a more sophisticated algorithm
* (by Fiala and Greene) which is guaranteed to run in linear amortized
* time, but has a larger average cost, uses more memory and is patented.
* However the F&G algorithm may be faster for some highly redundant
* files if the parameter max_chain_length (described below) is too large.
* (此处为介绍各种技巧思路的来源以及同之前版本的对⽐,⽆太⼤意义,故不译。) *
* ACKNOWLEDGEMENTS
*
* The idea of lazy evaluation of matches is due to Jan-Mark Wams, and
* I found it in 'freeze' written by Leonid Broukhis.
* Thanks to many info-zippers for bug reports and testing.
*
* REFERENCES
*
* APPNOTE.TXT documentation file in PKZIP 1.93a distribution.
*
* A description of the Rabin and Karp algorithm is given in the book
* "Algorithms" by R. Sedgewick, Addison-Wesley, p252.
*
* Fiala,E.R., and Greene,D.H.
* Data Compression with Finite Windows, Comm.ACM, 32,4 (1989) 490-595
* (此处亦为介绍各种技巧思路的来源以及同之前版本的对⽐,⽆太⼤意义,故不译。) *
* INTERFACE
* 2个接⼝函数
* void lm_init (int pack_level, ush *flags)
* Initialize the "longest match" routines for a new file
* 初始化函数,启动LZ77压缩进程;
* ulg deflate (void)
* Processes a new input file and return its compressed length. Sets
诗的起源* the compressed length, crc, deflate flags and internal file
* attributes.
* LZ77压缩主进程都在该函数⾥。
*/
苹果系统*(ii.)接下来开始分析函数 void lm_init (int pack_level, ush flags):
这⾥对哈希表的初始化可能有些令⼈疑惑,实际上,此处的哈希表并⾮由链接法或开域法实现,
⽽是由2个数组构成的,这样做的原因是节约空间,避免node的空间申请与删除。
(看不清楚请点右键在新选项卡中打开)
如图,哈希表由2个数组head[WSIZE]与prev[WSIZE]构成,WSIZE = ox8000 = 32k,
这两个数组构成了许多条链,所有由⼀个特定哈希函数取到相同哈希码的字符串们组成⼀条链。
head[]装载所有链头串的地址,prev[]则记录串与串之间的链接关系。链的具体实现⽅式稍后会解释。顺便⼀提,这种链的构造⽅式与boost库的内存池很相似。
其次,在查匹配串的时候使⽤了lazy match技术,就是说如果当前这对匹配串不够长,就把
strtsart指针继续右移⼀个字符,看看接下来到的匹配串是不是更好。
/* ===========================================================================
* Initialize the "longest match" routines for a new file
*/
void lm_init (pack_level, flags)
int pack_level; /* 0: store, 1: best speed, 9: best compression */
ush *flags; /* general purpose bit flag */
{
register unsigned j;
if (pack_level < 1 || pack_level > 9) error("bad pack level");
compr_level = pack_level;
/* Initialize the hash table.
* 将head数组初始化
*/
#if defined(MAXSEG_64K) && HASH_BITS == 15
for (j = 0; j < HASH_SIZE; j++) head[j] = NIL;
#else
memzero((char*)head, HASH_SIZE*sizeof(*head));
#endif
/
* prev will be initialized on the fly
* prev将会在后⾯动态地进⾏初始化
*/
//4种配置参数,见函数下⽅注1.
max_lazy_match = configuration_table[pack_level].max_lazy;
good_match = configuration_table[pack_level].good_length;福建公办大专学校
#ifndef FULL_SEARCH
nice_match = configuration_table[pack_level].nice_length;
#endif
max_chain_length = configuration_table[pack_level].max_chain;
max_chain_length = configuration_table[pack_level].max_chain;
if (pack_level == 1) {
*flags |= FAST;
} else if (pack_level == 9) {
*flags |= SLOW;
}
/* reduce max_chain_length for binary files */
//初始化strstart与block_start指针
strstart = 0;
block_start = 0L;
#ifdef ASMV
match_init(); /* initialize the asm code */
康途网#endif
//初始化超前查看区,read_buf函数详情见下⽅注2.
lookahead = read_buf((char*)window,
sizeof(int) <= 2 ? (unsigned)WSIZE : 2*WSIZE);
//如果没有⾜够空间,lookahead会得到标记值eof(end of file)
if (lookahead == 0 || lookahead == (unsigned)EOF) {
eofile = 1, lookahead = 0;
return;
}
//有⾜够空间
eofile = 0;
/排骨炖藕
* Make sure that we always have enough lookahead. This is important
* if input comes from a device such as a tty.
* 对于像从tty终端这样的设备的输⼊来说,保证超前查看区总有⾜够的空间是很重要的。 * 函数fill_window()主要负责向滑窗写⼊信息,是该⽂件中较长较重要的⼀个函数,,
* 后⾯会单独拿出来分析,就不放在注释的部分了。
* 常量 MIN_LOOKAHEAD 见注释3.
*/
while (lookahead < MIN_LOOKAHEAD && !eofile) fill_window();
//初始化哈希码,使⽤了宏UPDATE_HASH,该哈希算法⼗分有趣,对于理解
//整个算法也很重要,见注4。
ins_h = 0;
for (j=0; j<MIN_MATCH-1; j++) UPDATE_HASH(ins_h, window[j]);
/* If lookahead < MIN_MATCH, ins_h is garbage, but this is
* not important since only literal bytes will be emitted.
*/
}
@注1.
控制运⾏速度与效果的4个全局配置参数:
1. max_lazy_match
定义 local unsigned int max_lazy_match
Attempt to find a better match only when the current match is strictly
smaller than this value. This mechanism is used only for compression
levels >= 4.
⽤于判断是否进⾏lazy match,⼩于该长度则不满意,进⾏lazy match;⼤于该长度则认为还算可以,略过。该机制只在压缩程度>=4时⽣效。
2. good_match
unsigned near good_match
Use a faster search when the previous match is longer than this
当匹配串⽐该长度长时,通过调整系数(具体实现在longest_match中有描述)加快该次
搜索的速度。
3. nice_match
取值由是否定义宏FULL_SEARCH决定
#ifdef FULL_SEARCH
# define nice_match MAX_MATCH
(MAX_MATCH = 258, MIN_MATCH = 3,见gzip.h)
#else
int near nice_match;
#endif
Stop searching when current match exceeds this
当匹配长超过该值时,认为够好了,停⽌搜索。
4. max_chain_length
unsigned near max_chain_length;
To speed up deflation, hash chains are never searched beyond this length.
A higher limit improves compression ratio but degrades the speed.
⽤于加快deflation算法速度,在查时,遍历的串数⽬永不超过该值。
另外还有⼀个configuration_table数组,其下标对应不同的压缩等级:
local config configuration_table[10] = {
good lazy nice chain
0 {0, 0, 0, 0}, store only
1 {4, 4, 8, 4}, maximum speed, no lazy matches
2 {4, 5, 16, 8},
3 {4, 6, 32, 32},
4 {4, 4, 16, 16}, lazy matches
5 {8, 16, 32, 32},
6 {8, 16, 128, 128},
7 {8, 32, 128, 256},
古代通信方式8 {32, 128, 258, 1024},
9 {32, 258, 258, 4096}}; maximum compression
@注2.
事实上,函数read_buf真⾝位于zip.c⽂件。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论