init
This commit is contained in:
244
expkg/vendor/lz4/doc/lz4_Block_format.md
vendored
Normal file
244
expkg/vendor/lz4/doc/lz4_Block_format.md
vendored
Normal file
@ -0,0 +1,244 @@
|
||||
LZ4 Block Format Description
|
||||
============================
|
||||
Last revised: 2022-07-31 .
|
||||
Author : Yann Collet
|
||||
|
||||
|
||||
This specification is intended for developers willing to
|
||||
produce or read LZ4 compressed data blocks
|
||||
using any programming language of their choice.
|
||||
|
||||
LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
|
||||
There is no entropy encoder back-end nor framing layer.
|
||||
The latter is assumed to be handled by other parts of the system
|
||||
(see [LZ4 Frame format]).
|
||||
This design is assumed to favor simplicity and speed.
|
||||
|
||||
This document describes only the Block Format,
|
||||
not how the compressor nor decompressor actually work.
|
||||
For more details on such topics, see later section "Implementation Notes".
|
||||
|
||||
[LZ4 Frame format]: lz4_Frame_format.md
|
||||
|
||||
|
||||
|
||||
Compressed block format
|
||||
-----------------------
|
||||
An LZ4 compressed block is composed of sequences.
|
||||
A sequence is a suite of literals (not-compressed bytes),
|
||||
followed by a match copy operation.
|
||||
|
||||
Each sequence starts with a `token`.
|
||||
The `token` is a one byte value, separated into two 4-bits fields.
|
||||
Therefore each field ranges from 0 to 15.
|
||||
|
||||
|
||||
The first field uses the 4 high-bits of the token.
|
||||
It provides the length of literals to follow.
|
||||
|
||||
If the field value is smaller than 15,
|
||||
then it represents the total nb of literals present in the sequence,
|
||||
including 0, in which case there is no literal.
|
||||
|
||||
The value 15 is a special case: more bytes are required to indicate the full length.
|
||||
Each additional byte then represents a value from 0 to 255,
|
||||
which is added to the previous value to produce a total length.
|
||||
When the byte value is 255, another byte must be read and added, and so on.
|
||||
There can be any number of bytes of value `255` following `token`.
|
||||
The Block Format does not define any "size limit",
|
||||
though real implementations may feature some practical limits
|
||||
(see more details in later chapter "Implementation Notes").
|
||||
|
||||
Note : this format explains why a non-compressible input block is expanded by 0.4%.
|
||||
|
||||
Example 1 : A literal length of 48 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 33 : (=48-15) remaining length to reach 48
|
||||
|
||||
Example 2 : A literal length of 280 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 255 : following byte is maxed, since 280-15 >= 255
|
||||
- 10 : (=280 - 15 - 255) remaining length to reach 280
|
||||
|
||||
Example 3 : A literal length of 15 will be represented as :
|
||||
|
||||
- 15 : value for the 4-bits High field
|
||||
- 0 : (=15-15) yes, the zero must be output
|
||||
|
||||
Following `token` and optional length bytes, are the literals themselves.
|
||||
They are exactly as numerous as just decoded (length of literals).
|
||||
Reminder: it's possible that there are zero literals.
|
||||
|
||||
|
||||
Following the literals is the match copy operation.
|
||||
|
||||
It starts by the `offset` value.
|
||||
This is a 2 bytes value, in little endian format
|
||||
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
|
||||
|
||||
The `offset` represents the position of the match to be copied from the past.
|
||||
For example, 1 means "current position - 1 byte".
|
||||
The maximum `offset` value is 65535. 65536 and beyond cannot be coded.
|
||||
Note that 0 is an invalid `offset` value.
|
||||
The presence of a 0 `offset` value denotes an invalid (corrupted) block.
|
||||
|
||||
Then the `matchlength` can be extracted.
|
||||
For this, we use the second `token` field, the low 4-bits.
|
||||
Such a value, obviously, ranges from 0 to 15.
|
||||
However here, 0 means that the copy operation is minimal.
|
||||
The minimum length of a match, called `minmatch`, is 4.
|
||||
As a consequence, a 0 value means 4 bytes.
|
||||
Similarly to literal length, any value smaller than 15 represents a length,
|
||||
to which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
|
||||
A value of 15 is special, meaning 19+ bytes,
|
||||
to which one must read additional bytes, one at a time,
|
||||
with each byte value ranging from 0 to 255.
|
||||
They are added to total to provide the final match length.
|
||||
A 255 value means there is another byte to read and add.
|
||||
There is no limit to the number of optional `255` bytes that can be present,
|
||||
and therefore no limit to representable match length,
|
||||
though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
|
||||
|
||||
Note: this format has a maximum achievable compression ratio of about ~250.
|
||||
|
||||
Decoding the `matchlength` reaches the end of current sequence.
|
||||
Next byte will be the start of another sequence, and therefore a new `token`.
|
||||
|
||||
|
||||
End of block conditions
|
||||
-------------------------
|
||||
There are specific restrictions required to terminate an LZ4 block.
|
||||
|
||||
1. The last sequence contains only literals.
|
||||
The block ends right after the literals (no `offset` field).
|
||||
2. The last 5 bytes of input are always literals.
|
||||
Therefore, the last sequence contains at least 5 bytes.
|
||||
- Special : if input is smaller than 5 bytes,
|
||||
there is only one sequence, it contains the whole input as literals.
|
||||
Even empty input can be represented, using a zero byte,
|
||||
interpreted as a final token without literal and without a match.
|
||||
3. The last match must start at least 12 bytes before the end of block.
|
||||
The last match is part of the _penultimate_ sequence.
|
||||
It is followed by the last sequence, which contains _only_ literals.
|
||||
- Note that, as a consequence,
|
||||
blocks < 12 bytes cannot be compressed.
|
||||
And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
|
||||
because they must start by at least one literal,
|
||||
that the match can then copy afterwards.
|
||||
|
||||
When a block does not respect these end conditions,
|
||||
a conformant decoder is allowed to reject the block as incorrect.
|
||||
|
||||
These rules are in place to ensure compatibility with
|
||||
a wide range of historical decoders
|
||||
which rely on these conditions for their speed-oriented design.
|
||||
|
||||
Implementation notes
|
||||
-----------------------
|
||||
The LZ4 Block Format only defines the compressed format,
|
||||
it does not tell how to create a decoder or an encoder,
|
||||
which design is left free to the imagination of the implementer.
|
||||
|
||||
However, thanks to experience, there are a number of typical topics that
|
||||
most implementations will have to consider.
|
||||
This section tries to provide a few guidelines.
|
||||
|
||||
#### Metadata
|
||||
|
||||
An LZ4-compressed Block requires additional metadata for proper decoding.
|
||||
Typically, a decoder will require the compressed block's size,
|
||||
and an upper bound of decompressed size.
|
||||
Other variants exist, such as knowing the decompressed size,
|
||||
and having an upper bound of the input size.
|
||||
The Block Format does not specify how to transmit such information,
|
||||
which is considered an out-of-band information channel.
|
||||
That's because in many cases, the information is present in the environment.
|
||||
For example, databases must store the size of their compressed block for indexing,
|
||||
and know that their decompressed block can't be larger than a certain threshold.
|
||||
|
||||
If you need a format which is "self-contained",
|
||||
and also transports the necessary metadata for proper decoding on any platform,
|
||||
consider employing the [LZ4 Frame format] instead.
|
||||
|
||||
#### Large lengths
|
||||
|
||||
While the Block Format does not define any maximum value for length fields,
|
||||
in practice, most implementations will feature some form of limit,
|
||||
since it's expected for such values to be stored into registers of fixed bit width.
|
||||
|
||||
If length fields use 64-bit registers,
|
||||
then it can be assumed that there is no practical limit,
|
||||
as it would require a single continuous block of multiple petabytes to reach it,
|
||||
which is unreasonable by today's standard.
|
||||
|
||||
If length fields use 32-bit registers, then it can be overflowed,
|
||||
but requires a compressed block of size > 16 MB.
|
||||
Therefore, implementations that do not deal with compressed blocks > 16 MB are safe.
|
||||
However, if such a case is allowed,
|
||||
then it's recommended to check that no large length overflows the register.
|
||||
|
||||
If length fields use 16-bit registers,
|
||||
then it's definitely possible to overflow such register,
|
||||
with less than < 300 bytes of compressed data.
|
||||
|
||||
A conformant decoder should be able to detect length overflows when it's possible,
|
||||
and simply error out when that happens.
|
||||
The input block might not be invalid,
|
||||
it's just not decodable by the local decoder implementation.
|
||||
|
||||
Note that, in order to be compatible with the larger LZ4 ecosystem,
|
||||
it's recommended to be able to read and represent lengths of up to 4 MB,
|
||||
and to accept blocks of size up to 4 MB.
|
||||
Such limits are compatible with 32-bit length registers,
|
||||
and prevent overflow of 32-bit registers.
|
||||
|
||||
#### Safe decoding
|
||||
|
||||
If a decoder receives compressed data from any external source,
|
||||
it is recommended to ensure that the decoder is resilient to corrupted input,
|
||||
and made safe from buffer overflow manipulations.
|
||||
Always ensure that read and write operations
|
||||
remain within the limits of provided buffers.
|
||||
|
||||
Of particular importance, ensure that the nb of bytes instructed to copy
|
||||
does not overflow neither the input nor the output buffers.
|
||||
Ensure also, when reading an offset value, that the resulting position to copy
|
||||
does not reach beyond the beginning of the buffer.
|
||||
Such a situation can happen during the first 64 KB of decoded data.
|
||||
|
||||
For more safety, test the decoder with fuzzers
|
||||
to ensure it's resilient to improbable sequences of conditions.
|
||||
Combine them with sanitizers, in order to catch overflows (asan)
|
||||
or initialization issues (msan).
|
||||
|
||||
Pay some attention to offset 0 scenario, which is invalid,
|
||||
and therefore must not be blindly decoded:
|
||||
a naive implementation could preserve destination buffer content,
|
||||
which could then result in information disclosure
|
||||
if such buffer was uninitialized and still containing private data.
|
||||
For reference, in such a scenario, the reference LZ4 decoder
|
||||
clears the match segment with `0` bytes,
|
||||
though other solutions are certainly possible.
|
||||
|
||||
Finally, pay attention to the "overlap match" scenario,
|
||||
when `matchlength` is larger than `offset`.
|
||||
In which case, since `match_pos + matchlength > current_pos`,
|
||||
some of the later bytes to copy do not exist yet,
|
||||
and will be generated during the early stage of match copy operation.
|
||||
Such scenario must be handled with special care.
|
||||
A common case is an offset of 1,
|
||||
meaning the last byte is repeated `matchlength` times.
|
||||
|
||||
#### Compression techniques
|
||||
|
||||
The core of a LZ4 compressor is to detect duplicated data across past 64 KB.
|
||||
The format makes no assumption nor limits to the way a compressor
|
||||
searches and selects matches within the source data block.
|
||||
For example, an upper compression limit can be reached,
|
||||
using a technique called "full optimal parsing", at high cpu and memory cost.
|
||||
But multiple other techniques can be considered,
|
||||
featuring distinct time / performance trade-offs.
|
||||
As long as the specified format is respected,
|
||||
the result will be compatible with and decodable by any compliant decoder.
|
||||
Reference in New Issue
Block a user