BVGraph Format
Compressed graph format from the WebGraph framework
Overview
BVGraph is a highly compressed graph representation developed by Paolo Boldi and Sebastiano Vigna and first described in their WWW 2004 paper, “The WebGraph Framework I: Compression Techniques.” It is the native storage format of the WebGraph framework and is used extensively by the LAW (Laboratory for Web Algorithmics) dataset collection at the University of Milan.
BVGraph achieves very high compression ratios on web graphs and social networks—often representing each link in fewer than 3 bits—by exploiting two key properties of these graphs:
- Locality: successor lists of nearby nodes tend to be similar, so a node can reference a nearby node’s list and encode only the differences.
- Consecutivity: successor lists often contain long contiguous runs of IDs, which compress well as (start, length) intervals.
On top of these structural techniques, BVGraph encodes all integers using variable-length codes (gamma, delta, and zeta codes) that assign shorter representations to smaller values.
File Components
Each BVGraph dataset consists of three files sharing a common basename:
| File | Description |
|---|---|
<name>.graph | Compressed bitstream containing all successor lists, packed at the bit level. This is the main data file. |
<name>.properties | Java properties file with graph metadata: node/arc counts and all compression parameters needed to decode the .graph file. |
<name>.offsets | Bit-level offsets for each node’s entry in the .graph file. Stores the delta-coded bit position of each node, enabling random access and parallel decompression. |
The .offsets file can be regenerated from the .graph and .properties files:
# C++ (no Java required)
./bvgraph_gen_offsets data/eu-2005
# Java (using the WebGraph library)
java -cp jlibs/*: it.unimi.dsi.webgraph.BVGraph data/eu-2005 -O
Properties File
The .properties file is a plain-text Java properties file that stores all metadata and compression parameters required to decode the .graph bitstream.
Example
#BVGraph
#Thu Jan 01 00:00:00 CET 2024
graphclass=it.unimi.dsi.webgraph.BVGraph
version=0
nodes=862664
arcs=19235140
windowsize=7
maxrefcount=3
minintervallength=4
zetak=3
compressionflags=
Key Definitions
| Key | Type | Description |
|---|---|---|
nodes | integer | Total number of vertices in the graph. |
arcs | integer | Total number of directed edges (arcs). |
windowsize | integer | Size of the reference window. A node can reference any of the previous windowsize nodes. Typical value: 7. |
maxrefcount | integer | Maximum reference chain length. If node A references B which references C, the chain length is 2. Limits recursive lookups during decoding. Typical value: 3. |
minintervallength | integer | Minimum run length for interval encoding. Contiguous runs of successor IDs shorter than this threshold are not encoded as intervals. Set to 0 to disable interval encoding entirely. Typical value: 4. |
zetak | integer | The k parameter for zeta coding (ζ_k). Controls the “shrinking factor” that tunes the code for power-law distributions. Typical value: 3. |
compressionflags | string | Pipe-delimited list of coding overrides for each component (e.g., RESIDUALS_ZETA \| OUTDEGREES_GAMMA). When empty, default codings apply. |
Default Coding Assignments
When compressionflags is empty, the following defaults are used:
| Component | Default Coding |
|---|---|
| Outdegrees | Gamma (γ) |
| Blocks | Gamma (γ) |
| Block counts | Gamma (γ) |
| Residuals | Zeta (ζ_k) |
| References | Unary |
| Offsets | Gamma (γ) |
The compressionflags string can override any of these using the syntax COMPONENT_CODING, separated by |. For example: RESIDUALS_ZETA | BLOCKS_DELTA | OUTDEGREES_DELTA.
Integer Coding Schemes
BVGraph uses several variable-length integer codes to represent non-negative integers compactly. Smaller values receive shorter bit representations. All codes are read MSB-first (most significant bit first) from the bitstream.
Unary Code
Encodes value x as x zero bits followed by a single 1 bit.
| Value | Encoding |
|---|---|
| 0 | 1 |
| 1 | 01 |
| 2 | 001 |
| 3 | 0001 |
| x | x zeros + 1 |
Cost: x + 1 bits. Optimal for geometric distributions. Used primarily for reference values.
Gamma Code (γ) — Elias Gamma
Encodes x + 1 by writing the bit length of x + 1 in unary, followed by the value of x + 1 without its leading 1-bit.
- Compute ℓ = ⌊log₂(x + 1)⌋
- Write ℓ in unary (ℓ zeros + one 1-bit)
- Write the lower ℓ bits of x + 1
| Value | ℓ | Encoding |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 01 0 |
| 2 | 1 | 01 1 |
| 3 | 2 | 001 00 |
| 6 | 2 | 001 11 |
| 7 | 3 | 0001 000 |
Cost: 2⌊log₂(x + 1)⌋ + 1 bits. Default coding for outdegrees, blocks, and block counts.
Delta Code (δ) — Elias Delta
Like gamma, but the length prefix itself is gamma-coded rather than unary-coded. This is more efficient for large values.
- Compute ℓ = ⌊log₂(x + 1)⌋
- Write ℓ using gamma coding
- Write the lower ℓ bits of x + 1
Cost: ⌊log₂(ℓ)⌋ + 2⌊log₂(ℓ + 1)⌋ + 1 bits. More efficient than gamma for values larger than ~32.
Zeta Code (ζ_k) — Parameterized Code
A family of codes parameterized by a positive integer k (the “shrinking factor”), designed by Boldi and Vigna specifically for power-law distributed gap values in web graphs. The parameter k controls how the code trades off between small and large values.
Encoding of value x with parameter k:
- Compute h = ⌊log₂(x + 1) / k⌋ (the “quotient”)
- Write h in unary
- Compute m = x + 1 − 2^(hk) (the “remainder”)
- If m < 2^(hk), write m in hk + k − 1 bits
- Otherwise, write 2m or 2m + 1 in hk + k bits
With k = 3 (the typical value, called ζ₃), this is optimized for the power-law distributions commonly found in web graph successor gaps. Default coding for residual values.
Nibble Code
A variable-length code using 4-bit chunks. Each chunk consists of:
- 1 continuation bit (1 = more chunks follow, 0 = last chunk)
- 3 data bits
Chunks are read in sequence. The 3-bit values from each chunk are concatenated (least-significant first) to form the final value.
| Value | Encoding |
|---|---|
| 0 | 0 000 |
| 7 | 0 111 |
| 8 | 1 000 · 0 001 |
Successor List Encoding
Each node’s successor list is encoded as a sequence of fields in the .graph bitstream. The fields appear in the following order:
1. Outdegree
The number of successors (outgoing edges) for this node, encoded using the configured outdegree coding (default: gamma).
outdegree = readGamma(bitstream)
If the outdegree is 0, no further fields are read for this node.
2. Reference
If windowsize > 0, the next field is a reference offset r encoded using the configured reference coding (default: unary). This indicates that the successor list of node (x − r) should be used as a template.
- r = 0: no reference; the successor list is encoded independently.
- r > 0: reference node (x − r); copy blocks and differences follow.
ref = readUnary(bitstream) // 0 means no reference
3. Copy Blocks
When r > 0, a block count is read (gamma-coded), followed by a sequence of block lengths. These blocks describe which elements from the referenced node’s successor list to copy and which to skip, in alternating fashion:
- Block 0: number of elements to copy from the start of the referenced list
- Block 1: number of elements to skip
- Block 2: number of elements to copy
- Block 3: number of elements to skip
- …and so on, alternating copy/skip
blockCount = readGamma(bitstream)
blocks[0] = readGamma(bitstream) // first block: raw value
blocks[i] = readGamma(bitstream) + 1 // subsequent blocks: +1 offset
If the block count is even, all remaining elements from the referenced list (past the last explicit block) are implicitly copied. If odd, remaining elements are implicitly skipped.
4. Intervals
If minintervallength > 0 and there are extra successors (beyond those copied from the reference), the decoder reads an interval count (always gamma-coded):
- Interval count = 0: no intervals present.
- Interval count > 0: that many (left, length) pairs follow.
For the first interval:
- left₀ is encoded as a signed offset from x using zigzag encoding:
left₀ = nat2int(readGamma()) + x - length₀ =
readGamma() + minintervallength
For subsequent intervals (i ≥ 1):
- leftᵢ =
readGamma() + prev_end + 1(gap from the end of the previous interval) - lengthᵢ =
readGamma() + minintervallength
Each interval represents the contiguous range of successor IDs: [left, left + length).
5. Residuals
Any remaining successors that are neither copied from the reference nor covered by intervals are encoded as residuals using gap coding:
- residual₀: signed offset from x using zigzag encoding:
residual₀ = x + nat2int(readZeta(k)) - residualᵢ (for i ≥ 1):
residualᵢ₋₁ + readZeta(k) + 1(strictly increasing gaps)
Residuals are encoded using the configured residual coding (default: zeta with parameter k).
6. Three-Way Merge
The final successor list for node x is the sorted merge of three sources:
- Block-copied elements from the referenced node’s list
- Interval elements expanded from the (left, length) pairs
- Residual elements decoded from gap-coded values
By construction, each source produces values in sorted order, so a standard three-way merge yields the complete sorted successor list.
Signed Integer Encoding (nat2int)
Some values (interval start offsets and first residuals) are encoded as signed differences from the current node ID. BVGraph uses zigzag encoding to map signed integers to non-negative integers:
| Natural | Integer |
|---|---|
| 0 | 0 |
| 1 | −1 |
| 2 | 1 |
| 3 | −2 |
| 4 | 2 |
The decoding formula is: nat2int(x) = (x >> 1) ^ -(x & 1)
This allows both positive and negative differences to be encoded efficiently using codes designed for non-negative integers.
Parallel Decompression
The .offsets file stores the bit position of each node’s entry in the .graph file as delta-coded values. Once loaded, these offsets enable random access to any node and multi-threaded parallel decoding.
How It Works
- Load offsets: Read the
.offsetsfile to build an array mapping each node ID to its bit position in the.graphfile. - Partition nodes: Divide the node range
[0, N)into chunks, one per thread. - Warm up reference windows: Each thread must decode
windowSizenodes before its assigned chunk start. This “warm-up” phase populates the thread’s local reference window so that references within the chunk can be correctly resolved. - Decode independently: Each thread decodes its chunk using its own
InputBitStreamand cyclic reference window, writing results into pre-allocated output arrays at computed offsets.
Reference Window
The decoder maintains a cyclic buffer of size windowSize + 1 that stores the most recently decoded successor lists. When a node references node (x − r), the decoder looks up that node’s successor list in the cyclic buffer. This means:
- Sequential decoding naturally maintains the window.
- For parallel decoding, each thread must warm up its window before decoding its first assigned node.
- The warm-up cost is at most
windowSizeextra node decodings per thread (typically 7 nodes).
Example Datasets
The LAW dataset collection hosts hundreds of web graphs and social networks in BVGraph format. Here are some commonly used datasets:
| Dataset | Nodes | Edges | Compressed Size |
|---|---|---|---|
| cnr-2000 | 325K | 3.2M | ~1 MB |
| eu-2005 | 862K | 19M | ~9 MB |
| in-2004 | 1.4M | 17M | ~8 MB |
| indochina-2004 | 7.4M | 194M | ~34 MB |
| uk-2014 | 788M | 47B | ~26 GB |
Downloading
Datasets follow a consistent URL pattern:
BASE_URL="http://data.law.di.unimi.it/webdata"
GRAPH="eu-2005"
# Download the required files
wget "$BASE_URL/$GRAPH/$GRAPH.graph"
wget "$BASE_URL/$GRAPH/$GRAPH.properties"
# The .offsets file can be downloaded if available, or generated:
./bvgraph_gen_offsets $GRAPH
References
- P. Boldi, S. Vigna. “The WebGraph Framework I: Compression Techniques.” WWW 2004. Paper
- WebGraph Java library — the reference implementation
- DSI Utilities —
InputBitStreamand integer coding implementations - LAW dataset collection — hundreds of BVGraph datasets
- C++ BVGraph decoder — pure C++ implementation with parallel decompression