rocksdict

Abstract

This package enables users to store, query, and delete a large number of key-value pairs on disk.

This is especially useful when the data cannot fit into RAM. If you have hundreds of GBs or many TBs of key-value data to store and query from, this is the package for you.

Installation

This package is built for macOS (amd64/arm64), Windows , and Linux amd64/arm64. It can be installed from pypi with pip install rocksdict.

Introduction

Below is a code example that shows how to do the following:

  • Create Rdict
  • Store something on disk
  • Close Rdict
  • Open Rdict again
  • Check Rdict elements
  • Iterate from Rdict
  • Batch get
  • Delete storage
Examples:

::

from rocksdict import Rdict, Options

path = str("./test_dict")

# create a Rdict with default options at `path`
db = Rdict(path)

# storing numbers
db[1.0] = 1
db[1] = 1.0
db["huge integer"] = 2343546543243564534233536434567543
db["good"] = True
db["bad"] = False
db["bytes"] = b"bytes"
db["this is a list"] = [1, 2, 3]
db["store a dict"] = {0: 1}

# for example numpy array
import numpy as np
import pandas as pd
db[b"numpy"] = np.array([1, 2, 3])
db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]})

# close Rdict
db.close()

# reopen Rdict from disk
db = Rdict(path)
assert db[1.0] == 1
assert db[1] == 1.0
assert db["huge integer"] == 2343546543243564534233536434567543
assert db["good"] == True
assert db["bad"] == False
assert db["bytes"] == b"bytes"
assert db["this is a list"] == [1, 2, 3]
assert db["store a dict"] == {0: 1}
assert np.all(db[b"numpy"] == np.array([1, 2, 3]))
assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]}))

# iterate through all elements
for k, v in db.items():
    print(f"{k} -> {v}")

# batch get:
print(db[["good", "bad", 1.0]])
# [True, False, 1]

# delete Rdict from dict
db.close()
Rdict.destroy(path)

Supported types:

  • key: int, float, bool, str, bytes
  • value: int, float, bool, str, bytes and anything that supports pickle.
 1from .rocksdict import *
 2
 3__doc__ = rocksdict.__doc__
 4
 5__all__ = ["Rdict",
 6           "WriteBatch",
 7           "SstFileWriter",
 8           "AccessType",
 9           "WriteOptions",
10           "Snapshot",
11           "RdictIter",
12           "Options",
13           "ReadOptions",
14           "ColumnFamily",
15           "IngestExternalFileOptions",
16           "DBPath",
17           "MemtableFactory",
18           "BlockBasedOptions",
19           "PlainTableFactoryOptions",
20           "CuckooTableOptions",
21           "UniversalCompactOptions",
22           "UniversalCompactionStopStyle",
23           "SliceTransform",
24           "DataBlockIndexType",
25           "BlockBasedIndexType",
26           "Cache",
27           "ChecksumType",
28           "DBCompactionStyle",
29           "DBCompressionType",
30           "DBRecoveryMode",
31           "Env",
32           "FifoCompactOptions",
33           "CompactOptions",
34           "BottommostLevelCompaction",
35           "KeyEncodingType"]
36
37Rdict.__enter__ = lambda self: self
38Rdict.__exit__ = lambda self, exc_type, exc_val, exc_tb: self.close()
class Rdict:

A persistent on-disk dictionary. Supports string, int, float, bytes as key, values.

Example:

::

from rocksdict import Rdict

db = Rdict("./test_dir")
db[0] = 1

db = None
db = Rdict("./test_dir")
assert(db[0] == 1)
Arguments:
  • path (str): path to the database
  • options (Options): Options object
  • column_families (dict): (name, options) pairs, these Options must have the same raw_mode argument as the main Options. A column family called 'default' is always created.
  • access_type (AccessType): there are four access types: ReadWrite, ReadOnly, WithTTL, and Secondary, use AccessType class to create.
Rdict()
def set_dumps(self, /, dumps):

set custom dumps function

def set_loads(self, /, loads):

set custom loads function

def set_write_options(self, /, write_opt):

Optionally disable WAL or sync for this write.

Example:

::

from rocksdict import Rdict, Options, WriteBatch, WriteOptions

path = "_path_for_rocksdb_storageY1"
db = Rdict(path)

# set write options
write_options = WriteOptions()
write_options.set_sync(False)
write_options.disable_wal(True)
db.set_write_options(write_options)

# write to db
db["my key"] = "my value"
db["key2"] = "value2"
db["key3"] = "value3"

# remove db
del db
Rdict.destroy(path)
def set_read_options(self, /, read_opt):

Configure Read Options for all the get operations.

def get(self, /, key, default=None, read_opt=None):

Get value from key or a list of keys.

Arguments:
  • key: a single key or list of keys.
  • default: the default value to return if key not found.
  • read_opt: override preset read options (or use Rdict.set_read_options to preset a read options used by default).
Returns:

None or default value if the key does not exist.

def put(self, /, key, value, write_opt=None):

Insert key value into database.

Arguments:
  • key: the key.
  • value: the value.
  • write_opt: override preset write options (or use Rdict.set_write_options to preset a write options used by default).
def key_may_exist(self, /, key, fetch=False, read_opt=None):

Check if a key may exist without doing any IO.

Notes:

If the key definitely does not exist in the database, then this method returns False, else True. If the caller wants to obtain value when the key is found in memory, fetch should be set to True. This check is potentially lighter-weight than invoking DB::get(). One way to make this lighter weight is to avoid doing any IOs.

The API follows the following principle:

  • True, and value found => the key must exist.
  • True => the key may or may not exist.
  • False => the key definitely does not exist.

Flip it around:

  • key exists => must return True, but value may or may not be found.
  • key doesn't exists => might still return True.
Arguments:
  • key: Key to check
  • read_opt: ReadOptions
Returns:

if fetch = False, returning True implies that the key may exist. returning False implies that the key definitely does not exist. if fetch = True, returning (True, value) implies that the key is found and definitely exist. returning (False, None) implies that the key definitely does not exist. returning (True, None) implies that the key may exist.

def delete(self, /, key, write_opt=None):

Delete entry from the database.

Arguments:
  • key: the key.
  • write_opt: override preset write options (or use Rdict.set_write_options to preset a write options used by default).
def iter(self, /, read_opt=None):

Reversible for iterating over keys and values.

Examples:

::

from rocksdict import Rdict, Options, ReadOptions

path = "_path_for_rocksdb_storage5"
db = Rdict(path)

for i in range(50):
    db[i] = i ** 2

iter = db.iter()

iter.seek_to_first()

j = 0
while iter.valid():
    assert iter.key() == j
    assert iter.value() == j ** 2
    print(f"{iter.key()} {iter.value()}")
    iter.next()
    j += 1

iter.seek_to_first();
assert iter.key() == 0
assert iter.value() == 0
print(f"{iter.key()} {iter.value()}")

iter.seek(25)
assert iter.key() == 25
assert iter.value() == 625
print(f"{iter.key()} {iter.value()}")

del iter, db
Rdict.destroy(path)
Arguments:
  • read_opt: ReadOptions

Returns: Reversible

def items(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all keys and values pairs.

Examples:

::

for k, v in db.items():
    print(f"{k} -> {v}")
Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions
def keys(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all keys

Examples:

::

all_keys = [k for k in db.keys()]
Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions
def values(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all values.

Examples:

::

all_keys = [v for v in db.values()]
Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions, must have the same raw_mode argument.
def flush(self, /, wait=True):

Manually flush the current column family.

Notes:

Manually call mem-table flush. It is recommended to call flush() or close() before stopping the python program, to ensure that all written key-value pairs have been flushed to the disk.

Arguments:
  • wait (bool): whether to wait for the flush to finish.
def flush_wal(self, /, sync=True):

Flushes the WAL buffer. If sync is set to true, also syncs the data to disk.

def create_column_family(self, /, name, options=Ellipsis):

Creates column family with given name and options.

Arguments:
  • name: name of this column family
  • options: Rdict Options for this column family
Return:

the newly created column family

def drop_column_family(self, /, name):

Drops the column family with the given name

def get_column_family(self, /, name):

Get a column family Rdict

Arguments:
  • name: name of this column family
  • options: Rdict Options for this column family
Return:

the column family Rdict of this name

def get_column_family_handle(self, /, name):

Use this method to obtain a ColumnFamily instance, which can be used in WriteBatch.

Example:

::

wb = WriteBatch()
for i in range(100):
    wb.put(i, i**2, db.get_column_family_handle(cf_name_1))
db.write(wb)

wb = WriteBatch()
wb.set_default_column_family(db.get_column_family_handle(cf_name_2))
for i in range(100, 200):
    wb[i] = i**2
db.write(wb)
def snapshot(self, /):

A snapshot of the current column family.

Examples:

::

from rocksdict import Rdict

db = Rdict("tmp")
for i in range(100):
    db[i] = i

# take a snapshot
snapshot = db.snapshot()

for i in range(90):
    del db[i]

# 0-89 are no longer in db
for k, v in db.items():
    print(f"{k} -> {v}")

# but they are still in the snapshot
for i in range(100):
    assert snapshot[i] == i

# drop the snapshot
del snapshot, db

Rdict.destroy("tmp")
def ingest_external_file(self, /, paths, opts=Ellipsis):

Loads a list of external SST files created with SstFileWriter into the current column family.

Arguments:
  • paths: a list a paths
  • opts: IngestExternalFileOptionsPy instance
def try_catch_up_with_primary(self, /):

Tries to catch up with the primary by reading as much as possible from the log files.

def cancel_all_background(self, /, wait):

Request stopping background work, if wait is true wait until it's done.

def write(self, /, write_batch, write_opt=None):

WriteBatch

Notes:

This WriteBatch does not write to the current column family.

Arguments:
  • write_batch: WriteBatch instance. This instance will be consumed.
  • write_opt: use default value if not provided.
def delete_range(self, /, begin, end, write_opt=None):

Removes the database entries in the range ["from", "to") of the current column family.

Arguments:
  • begin: included
  • end: excluded
  • write_opt: WriteOptions
def close(self, /):

Flush memory to disk, and drop the current column family.

Notes:

Calling db.close() is nearly equivalent to first calling db.flush() and then del db. However, db.close() does not guarantee the underlying RocksDB to be actually closed. Other Column Family Rdict instances, ColumnFamily (cf handle) instances, iterator instances such asRdictIter, RdictItems, RdictKeys, RdictValues can all keep RocksDB alive. del or close all associated instances mentioned above to actually shut down RocksDB.

def path(self, /):

Return current database path.

def compact_range(self, /, begin, end, compact_opt=Ellipsis):

Runs a manual compaction on the Range of keys given for the current Column Family.

def set_options(self, /, options):

Set options for the current column family.

def property_value(self, /, name):

Retrieves a RocksDB property by name, for the current column family.

def property_int_value(self, /, name):

Retrieves a RocksDB property and casts it to an integer (for the current column family).

Full list of properties that return int values could be find here.

def latest_sequence_number(self, /):

The sequence number of the most recent transaction.

def live_files(self, /):

Returns a list of all table files with their level, start key and end key

def destroy(path, options=Ellipsis):

Delete the database.

Arguments:
  • path (str): path to this database
  • options (rocksdict.Options): Rocksdb options object
def repair(path, options=Ellipsis):

Repair the database.

Arguments:
  • path (str): path to this database
  • options (rocksdict.Options): Rocksdb options object
def list_cf(path, options=Ellipsis):
class WriteBatch:

WriteBatch class. Use db.write() to ingest WriteBatch.

Notes:

A WriteBatch instance can only be ingested once, otherwise an Exception will be raised.

Arguments:
  • raw_mode (bool): make sure that this is consistent with the Rdict.
WriteBatch()
def set_dumps(self, /, dumps):

change to a custom dumps function

def set_default_column_family(self, /, column_family=None):

Set the default item for a[i] = j and del a[i] syntax.

You can also use put(key, value, column_family) to explicitly choose column family.

Arguments:
  • - column_family (ColumnFamily | None): column family descriptor or None (for default family).
def len(self, /):

length of the batch

def size_in_bytes(self, /):

Return WriteBatch serialized size (in bytes).

def is_empty(self, /):

Check whether the batch is empty.

def put(self, /, key, value, column_family=None):

Insert a value into the database under the given key.

Arguments:
  • column_family: override the default column family set by set_default_column_family
def delete(self, /, key, column_family=None):

Removes the database entry for key. Does nothing if the key was not found.

Arguments:
  • column_family: override the default column family set by set_default_column_family
def delete_range(self, /, begin, end, column_family=None):

Remove database entries in column family from start key to end key.

Notes:

Removes the database entries in the range ["begin_key", "end_key"), i.e., including "begin_key" and excluding "end_key". It is not an error if no keys exist in the range ["begin_key", "end_key").

Arguments:
  • begin: begin key
  • end: end key
  • column_family: override the default column family set by set_default_column_family
def clear(self, /):

Clear all updates buffered in this batch.

class SstFileWriter:

SstFileWriter is used to create sst files that can be added to database later All keys in files generated by SstFileWriter will have sequence number = 0.

Arguments:
  • options: this options must have the same raw_mode as the Rdict DB.
SstFileWriter()
def set_dumps(self, /, dumps):

set custom dumps function

def open(self, /, path):

Prepare SstFileWriter to write into file located at "file_path".

def finish(self, /):

Finalize writing to sst file and close file.

def file_size(self, /):

returns the current file size

class AccessType:

Define DB Access Types.

Notes:

There are four access types:

  • ReadWrite: default value
  • ReadOnly
  • WithTTL
  • Secondary
Examples:

::

from rocksdict import Rdict, AccessType

# open with 24 hours ttl
db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600))

# open as read_only
db = Rdict("./main_path", access_type = AccessType.read_only())

# open as secondary
db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
AccessType()
def read_write():

Define DB Access Types.

Notes:

There are four access types:

  • ReadWrite: default value
  • ReadOnly
  • WithTTL
  • Secondary
Examples:

::

from rocksdict import Rdict, AccessType

# open with 24 hours ttl
db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600))

# open as read_only
db = Rdict("./main_path", access_type = AccessType.read_only())

# open as secondary
db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
def read_only(error_if_log_file_exist=False):

Define DB Access Types.

Notes:

There are four access types:

  • ReadWrite: default value
  • ReadOnly
  • WithTTL
  • Secondary
Examples:

::

from rocksdict import Rdict, AccessType

# open with 24 hours ttl
db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600))

# open as read_only
db = Rdict("./main_path", access_type = AccessType.read_only())

# open as secondary
db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
def secondary(secondary_path):

Define DB Access Types.

Notes:

There are four access types:

  • ReadWrite: default value
  • ReadOnly
  • WithTTL
  • Secondary
Examples:

::

from rocksdict import Rdict, AccessType

# open with 24 hours ttl
db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600))

# open as read_only
db = Rdict("./main_path", access_type = AccessType.read_only())

# open as secondary
db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
def with_ttl(duration):

Define DB Access Types.

Notes:

There are four access types:

  • ReadWrite: default value
  • ReadOnly
  • WithTTL
  • Secondary
Examples:

::

from rocksdict import Rdict, AccessType

# open with 24 hours ttl
db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600))

# open as read_only
db = Rdict("./main_path", access_type = AccessType.read_only())

# open as secondary
db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
class WriteOptions:

Optionally disable WAL or sync for this write.

Example:

::

from rocksdict import Rdict, Options, WriteBatch, WriteOptions

path = "_path_for_rocksdb_storageY1"
db = Rdict(path, Options())

# set write options
write_options = WriteOptions()
write_options.set_sync(false)
write_options.disable_wal(true)
db.set_write_options(write_options)

# write to db
db["my key"] = "my value"
db["key2"] = "value2"
db["key3"] = "value3"

# remove db
del db
Rdict.destroy(path, Options())
WriteOptions()
sync

Sets the sync mode. If true, the write will be flushed from the operating system buffer cache before the write is considered complete. If this flag is true, writes will be slower.

Default: false

ignore_missing_column_families

If true and if user is trying to write to column families that don't exist (they were dropped), ignore the write (don't return an error). If there are multiple writes in a WriteBatch, other writes will succeed.

Default: false

low_pri

If true, this write request is of lower priority if compaction is behind. In this case, no_slowdown = true, the request will be cancelled immediately with Status::Incomplete() returned. Otherwise, it will be slowed down. The slowdown value is determined by RocksDB to guarantee it introduces minimum impacts to high priority writes.

Default: false

disable_wal

Sets whether WAL should be active or not. If true, writes will not first go to the write ahead log, and the write may got lost after a crash.

Default: false

no_slowdown

If true and we need to wait or sleep for the write request, fails immediately with Status::Incomplete().

Default: false

memtable_insert_hint_per_batch

If true, writebatch will maintain the last insert positions of each memtable as hints in concurrent write. It can improve write performance in concurrent writes if keys in one writebatch are sequential. In non-concurrent writes (when concurrent_memtable_writes is false) this option will be ignored.

Default: false

class Snapshot:

A consistent view of the database at the point of creation.

Examples:

::

from rocksdict import Rdict

db = Rdict("tmp")
for i in range(100):
    db[i] = i

# take a snapshot
snapshot = db.snapshot()

for i in range(90):
    del db[i]

# 0-89 are no longer in db
for k, v in db.items():
    print(f"{k} -> {v}")

# but they are still in the snapshot
for i in range(100):
    assert snapshot[i] == i

# drop the snapshot
del snapshot, db

Rdict.destroy("tmp")
Snapshot()
def iter(self, /, read_opt=None):

Creates an iterator over the data in this snapshot under the given column family, using the default read options.

Arguments:
  • read_opt: ReadOptions, must have the same raw_mode argument.
def items(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all keys and values pairs.

Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions, must have the same raw_mode argument.
def keys(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all keys.

Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions, must have the same raw_mode argument.
def values(self, /, backwards=False, from_key=None, read_opt=None):

Iterate through all values.

Arguments:
  • backwards: iteration direction, forward if False.
  • from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
  • read_opt: ReadOptions, must have the same raw_mode argument.
class RdictIter:
RdictIter()
def valid(self, /):

Returns true if the iterator is valid. An iterator is invalidated when it reaches the end of its defined range, or when it encounters an error.

To check whether the iterator encountered an error after valid has returned false, use the status method. status will never return an error when valid is true.

def status(self, /):

Returns an error Result if the iterator has encountered an error during operation. When an error is encountered, the iterator is invalidated and valid will return false when called.

Performing a seek will discard the current status.

def seek_to_first(self, /):

Seeks to the first key in the database.

Example:

::

from rocksdict import Rdict, Options, ReadOptions

path = "_path_for_rocksdb_storage5"
db = Rdict(path, Options())
iter = db.iter(ReadOptions())

# Iterate all keys from the start in lexicographic order
iter.seek_to_first()

while iter.valid():
    print(f"{iter.key()} {iter.value()}")
    iter.next()

# Read just the first key
iter.seek_to_first();
print(f"{iter.key()} {iter.value()}")

del iter, db
Rdict.destroy(path, Options())
def seek_to_last(self, /):

Seeks to the last key in the database.

Example:

::

from rocksdict import Rdict, Options, ReadOptions

path = "_path_for_rocksdb_storage6"
db = Rdict(path, Options())
iter = db.iter(ReadOptions())

# Iterate all keys from the start in lexicographic order
iter.seek_to_last()

while iter.valid():
    print(f"{iter.key()} {iter.value()}")
    iter.prev()

# Read just the last key
iter.seek_to_last();
print(f"{iter.key()} {iter.value()}")

del iter, db
Rdict.destroy(path, Options())
def seek(self, /, key):

Seeks to the specified key or the first key that lexicographically follows it.

This method will attempt to seek to the specified key. If that key does not exist, it will find and seek to the key that lexicographically follows it instead.

Example:

::

from rocksdict import Rdict, Options, ReadOptions

path = "_path_for_rocksdb_storage6"
db = Rdict(path, Options())
iter = db.iter(ReadOptions())

# Read the first string key that starts with 'a'
iter.seek("a");
print(f"{iter.key()} {iter.value()}")

del iter, db
Rdict.destroy(path, Options())
def seek_for_prev(self, /, key):

Seeks to the specified key, or the first key that lexicographically precedes it.

Like .seek() this method will attempt to seek to the specified key. The difference with .seek() is that if the specified key do not exist, this method will seek to key that lexicographically precedes it instead.

Example:

::

from rocksdict import Rdict, Options, ReadOptions

path = "_path_for_rocksdb_storage6"
db = Rdict(path, Options())
iter = db.iter(ReadOptions())

# Read the last key that starts with 'a'
seek_for_prev("b")
print(f"{iter.key()} {iter.value()}")

del iter, db
Rdict.destroy(path, Options())
def next(self, /):

Seeks to the next key.

def prev(self, /):

Seeks to the previous key.

def key(self, /):

Returns the current key.

def value(self, /):

Returns the current value.

class Options:

Database-wide options around performance and behavior.

Please read the official tuning guide and most importantly, measure performance under realistic workloads with realistic hardware.

Example:

::

from rocksdict import Options, Rdict, DBCompactionStyle

def badly_tuned_for_somebody_elses_disk():

    path = "path/for/rocksdb/storageX"

    opts = Options()
    opts.create_if_missing(true)
    opts.set_max_open_files(10000)
    opts.set_use_fsync(false)
    opts.set_bytes_per_sync(8388608)
    opts.optimize_for_point_lookup(1024)
    opts.set_table_cache_num_shard_bits(6)
    opts.set_max_write_buffer_number(32)
    opts.set_write_buffer_size(536870912)
    opts.set_target_file_size_base(1073741824)
    opts.set_min_write_buffer_number_to_merge(4)
    opts.set_level_zero_stop_writes_trigger(2000)
    opts.set_level_zero_slowdown_writes_trigger(0)
    opts.set_compaction_style(DBCompactionStyle.universal())
    opts.set_disable_auto_compactions(true)

    return Rdict(path, opts)
Arguments:
  • raw_mode (bool): set this to True to operate in raw mode (i.e. it will only allow bytes as key-value pairs, and is compatible with other RockDB database).
Options()
def load_latest(path, env=Ellipsis, ignore_unknown_options=False, cache=Ellipsis):

Load latest options from the rocksdb path

Returns a tuple, where the first item is Options and the second item is a Dict of column families.

def increase_parallelism(self, /, parallelism):

By default, RocksDB uses only one background thread for flush and compaction. Calling this function will set it up such that total of total_threads is used. Good value for total_threads is the number of cores. You almost definitely want to call this function if your system is bottlenecked by RocksDB.

def optimize_level_style_compaction(self, /, memtable_memory_budget):

Optimize level style compaction.

Default values for some parameters in Options are not optimized for heavy workloads and big datasets, which means you might observe write stalls under some conditions.

This can be used as one of the starting points for tuning RocksDB options in such cases.

Internally, it sets write_buffer_size, min_write_buffer_number_to_merge, max_write_buffer_number, level0_file_num_compaction_trigger, target_file_size_base, max_bytes_for_level_base, so it can override if those parameters were set before.

It sets buffer sizes so that memory consumption would be constrained by memtable_memory_budget.

def optimize_universal_style_compaction(self, /, memtable_memory_budget):

Optimize universal style compaction.

Default values for some parameters in Options are not optimized for heavy workloads and big datasets, which means you might observe write stalls under some conditions.

This can be used as one of the starting points for tuning RocksDB options in such cases.

Internally, it sets write_buffer_size, min_write_buffer_number_to_merge, max_write_buffer_number, level0_file_num_compaction_trigger, target_file_size_base, max_bytes_for_level_base, so it can override if those parameters were set before.

It sets buffer sizes so that memory consumption would be constrained by memtable_memory_budget.

def create_if_missing(self, /, create_if_missing):

If true, any column families that didn't exist when opening the database will be created.

Default: true

def create_missing_column_families(self, /, create_missing_cfs):

If true, any column families that didn't exist when opening the database will be created.

Default: false

def set_error_if_exists(self, /, enabled):

Specifies whether an error should be raised if the database already exists.

Default: false

def set_paranoid_checks(self, /, enabled):

Enable/disable paranoid checks.

If true, the implementation will do aggressive checking of the data it is processing and will stop early if it detects any errors. This may have unforeseen ramifications: for example, a corruption of one DB entry may cause a large number of entries to become unreadable or for the entire DB to become unopenable. If any of the writes to the database fails (Put, Delete, Merge, Write), the database will switch to read-only mode and fail all other Write operations.

Default: false

def set_db_paths(self, /, paths):

A list of paths where SST files can be put into, with its target size. Newer data is placed into paths specified earlier in the vector while older data gradually moves to paths specified later in the vector.

For example, you have a flash device with 10GB allocated for the DB, as well as a hard drive of 2TB, you should config it to be: [{"/flash_path", 10GB}, {"/hard_drive", 2TB}]

The system will try to guarantee data under each path is close to but not larger than the target size. But current and future file sizes used by determining where to place a file are based on best-effort estimation, which means there is a chance that the actual size under the directory is slightly more than target size under some workloads. User should give some buffer room for those cases.

If none of the paths has sufficient room to place a file, the file will be placed to the last path anyway, despite to the target size.

Placing newer data to earlier paths is also best-efforts. User should expect user files to be placed in higher levels in some extreme cases.

If left empty, only one path will be used, which is path passed when opening the DB.

Default: empty

from rocksdict import Options, DBPath

opt = Options()
flash_path = DBPath("/flash_path", 10 * 1024 * 1024 * 1024) # 10 GB
hard_drive = DBPath("/hard_drive", 2 * 1024 * 1024 * 1024 * 1024) # 2 TB
opt.set_db_paths([flash_path, hard_drive])
def set_env(self, /, env):

Use the specified object to interact with the environment, e.g. to read/write files, schedule background work, etc. In the near future, support for doing storage operations such as read/write files through env will be deprecated in favor of file_system.

def set_compression_type(self, /, t):

Sets the compression algorithm that will be used for compressing blocks.

Default: DBCompressionType::Snappy (DBCompressionType::None if snappy feature is not enabled).

Example:

::

from rocksdict import Options, DBCompressionType

opts = Options()
opts.set_compression_type(DBCompressionType.snappy())
def set_compression_per_level(self, /, level_types):

Different levels can have different compression policies. There are cases where most lower levels would like to use quick compression algorithms while the higher levels (which have more data) use compression algorithms that have better compression but could be slower. This array, if non-empty, should have an entry for each level of the database; these override the value specified in the previous field 'compression'.

Example:

::

from rocksdict import Options, DBCompressionType

opts = Options()
opts.set_compression_per_level([
    DBCompressionType.none(),
    DBCompressionType.none(),
    DBCompressionType.snappy(),
    DBCompressionType.snappy(),
    DBCompressionType.snappy()
])
def set_compression_options(self, /, w_bits, level, strategy, max_dict_bytes):

Maximum size of dictionaries used to prime the compression library. Enabling dictionary can improve compression ratios when there are repetitions across data blocks.

The dictionary is created by sampling the SST file data. If zstd_max_train_bytes is nonzero, the samples are passed through zstd's dictionary generator. Otherwise, the random samples are used directly as the dictionary.

When compression dictionary is disabled, we compress and write each block before buffering data for the next one. When compression dictionary is enabled, we buffer all SST file data in-memory so we can sample it, as data can only be compressed and written after the dictionary has been finalized. So users of this feature may see increased memory usage.

Default: 0

def set_zstd_max_train_bytes(self, /, value):

Sets maximum size of training data passed to zstd's dictionary trainer. Using zstd's dictionary trainer can achieve even better compression ratio improvements than using max_dict_bytes alone.

The training data will be used to generate a dictionary of max_dict_bytes.

Default: 0.

def set_compaction_readahead_size(self, /, compaction_readahead_size):

If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. That way RocksDB's compaction is doing sequential instead of random reads.

When non-zero, we also force new_table_reader_for_compaction_inputs to true.

Default: 0

def set_level_compaction_dynamic_level_bytes(self, /, v):

Allow RocksDB to pick dynamic base of bytes for levels. With this feature turned on, RocksDB will automatically adjust max bytes for each level. The goal of this feature is to have lower bound on size amplification.

Default: false.

def set_prefix_extractor(self, /, prefix_extractor):
def optimize_for_point_lookup(self, /, cache_size):
def set_optimize_filters_for_hits(self, /, optimize_for_hits):

Sets the optimize_filters_for_hits flag

Default: false

def set_delete_obsolete_files_period_micros(self, /, micros):

Sets the periodicity when obsolete files get deleted.

The files that get out of scope by compaction process will still get automatically delete on every compaction, regardless of this setting.

Default: 6 hours

def prepare_for_bulk_load(self, /):

Prepare the DB for bulk loading.

All data will be in level 0 without any automatic compaction. It's recommended to manually call CompactRange(NULL, NULL) before reading from the database, because otherwise the read can be very slow.

def set_max_open_files(self, /, nfiles):

Sets the number of open files that can be used by the DB. You may need to increase this if your database has a large working set. Value -1 means files opened are always kept open. You can estimate number of files based on target_file_size_base and target_file_size_multiplier for level-based compaction. For universal-style compaction, you can usually set it to -1.

Default: -1

def set_max_file_opening_threads(self, /, nthreads):

If max_open_files is -1, DB will open all files on DB::Open(). You can use this option to increase the number of threads used to open the files. Default: 16

def set_use_fsync(self, /, useit):

If true, then every store to stable storage will issue a fsync. If false, then every store to stable storage will issue a fdatasync. This parameter should be set to true while storing data to filesystem like ext3 that can lose files after a reboot.

Default: false

def set_db_log_dir(self, /, path):

Specifies the absolute info LOG dir.

If it is empty, the log files will be in the same dir as data. If it is non empty, the log files will be in the specified dir, and the db data dir's absolute path will be used as the log file name's prefix.

Default: empty

def set_bytes_per_sync(self, /, nbytes):

Allows OS to incrementally sync files to disk while they are being written, asynchronously, in the background. This operation can be used to smooth out write I/Os over time. Users shouldn't rely on it for persistency guarantee. Issue one request for every bytes_per_sync written. 0 turns it off.

Default: 0

You may consider using rate_limiter to regulate write rate to device. When rate limiter is enabled, it automatically enables bytes_per_sync to 1MB.

This option applies to table files

def set_wal_bytes_per_sync(self, /, nbytes):

Same as bytes_per_sync, but applies to WAL files.

Default: 0, turned off

Dynamically changeable through SetDBOptions() API.

def set_writable_file_max_buffer_size(self, /, nbytes):

Sets the maximum buffer size that is used by WritableFileWriter.

On Windows, we need to maintain an aligned buffer for writes. We allow the buffer to grow until it's size hits the limit in buffered IO and fix the buffer size when using direct IO to ensure alignment of write requests if the logical sector size is unusual

Default: 1024 * 1024 (1 MB)

Dynamically changeable through SetDBOptions() API.

def set_allow_concurrent_memtable_write(self, /, allow):

If true, allow multi-writers to update mem tables in parallel. Only some memtable_factory-s support concurrent writes; currently it is implemented only for SkipListFactory. Concurrent memtable writes are not compatible with inplace_update_support or filter_deletes. It is strongly recommended to set enable_write_thread_adaptive_yield if you are going to use this feature.

Default: true

def set_enable_write_thread_adaptive_yield(self, /, enabled):

If true, threads synchronizing with the write batch group leader will wait for up to write_thread_max_yield_usec before blocking on a mutex. This can substantially improve throughput for concurrent workloads, regardless of whether allow_concurrent_memtable_write is enabled.

Default: true

def set_max_sequential_skip_in_iterations(self, /, num):

Specifies whether an iteration->Next() sequentially skips over keys with the same user-key or not.

This number specifies the number of keys (with the same userkey) that will be sequentially skipped before a reseek is issued.

Default: 8

def set_use_direct_reads(self, /, enabled):

Enable direct I/O mode for reading they may or may not improve performance depending on the use case

Files will be opened in "direct I/O" mode which means that data read from the disk will not be cached or buffered. The hardware buffer of the devices may however still be used. Memory mapped files are not impacted by these parameters.

Default: false

def set_use_direct_io_for_flush_and_compaction(self, /, enabled):

Enable direct I/O mode for flush and compaction

Files will be opened in "direct I/O" mode which means that data written to the disk will not be cached or buffered. The hardware buffer of the devices may however still be used. Memory mapped files are not impacted by these parameters. they may or may not improve performance depending on the use case

Default: false

def set_is_fd_close_on_exec(self, /, enabled):

Enable/dsiable child process inherit open files.

Default: true

def set_table_cache_num_shard_bits(self, /, nbits):

Sets the number of shards used for table cache.

Default: 6

def set_target_file_size_multiplier(self, /, multiplier):

By default target_file_size_multiplier is 1, which means by default files in different levels will have similar size.

Dynamically changeable through SetOptions() API

def set_min_write_buffer_number(self, /, nbuf):

Sets the minimum number of write buffers that will be merged together before writing to storage. If set to 1, then all write buffers are flushed to L0 as individual files and this increases read amplification because a get request has to check in all of these files. Also, an in-memory merge may result in writing lesser data to storage if there are duplicate records in each of these individual write buffers.

Default: 1

def set_max_write_buffer_number(self, /, nbuf):

Sets the maximum number of write buffers that are built up in memory. The default and the minimum number is 2, so that when 1 write buffer is being flushed to storage, new writes can continue to the other write buffer. If max_write_buffer_number > 3, writing will be slowed down to options.delayed_write_rate if we are writing to the last write buffer allowed.

Default: 2

def set_write_buffer_size(self, /, size):

Sets the amount of data to build up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk file.

Larger values increase performance, especially during bulk loads. Up to max_write_buffer_number write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database is opened.

Note that write_buffer_size is enforced per column family. See db_write_buffer_size for sharing memory across column families.

Default: 0x4000000 (64MiB)

Dynamically changeable through SetOptions() API

def set_db_write_buffer_size(self, /, size):

Amount of data to build up in memtables across all column families before writing to disk.

This is distinct from write_buffer_size, which enforces a limit for a single memtable.

This feature is disabled by default. Specify a non-zero value to enable it.

Default: 0 (disabled)

def set_max_bytes_for_level_base(self, /, size):

Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-1. Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) * (max_bytes_for_level_multiplier ^ (L-1)) For example, if max_bytes_for_level_base is 200MB, and if max_bytes_for_level_multiplier is 10, total data size for level-1 will be 200MB, total file size for level-2 will be 2GB, and total file size for level-3 will be 20GB.

Default: 0x10000000 (256MiB).

Dynamically changeable through SetOptions() API

def set_max_bytes_for_level_multiplier(self, /, mul):

Default: 10

def set_max_manifest_file_size(self, /, size):

The manifest file is rolled over on reaching this limit. The older manifest file be deleted. The default value is MAX_INT so that roll-over does not take place.

def set_target_file_size_base(self, /, size):

Sets the target file size for compaction. target_file_size_base is per-file size for level-1. Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB.

Default: 0x4000000 (64MiB)

Dynamically changeable through SetOptions() API

def set_min_write_buffer_number_to_merge(self, /, to_merge):

Sets the minimum number of write buffers that will be merged together before writing to storage. If set to 1, then all write buffers are flushed to L0 as individual files and this increases read amplification because a get request has to check in all of these files. Also, an in-memory merge may result in writing lesser data to storage if there are duplicate records in each of these individual write buffers.

Default: 1

def set_level_zero_file_num_compaction_trigger(self, /, n):

Sets the number of files to trigger level-0 compaction. A value < 0 means that level-0 compaction will not be triggered by number of files at all.

Default: 4

Dynamically changeable through SetOptions() API

def set_level_zero_slowdown_writes_trigger(self, /, n):

Sets the soft limit on number of level-0 files. We start slowing down writes at this point. A value < 0 means that no writing slow down will be triggered by number of files in level-0.

Default: 20

Dynamically changeable through SetOptions() API

def set_level_zero_stop_writes_trigger(self, /, n):

Sets the maximum number of level-0 files. We stop writes at this point.

Default: 24

Dynamically changeable through SetOptions() API

def set_compaction_style(self, /, style):

Sets the compaction style.

Default: DBCompactionStyle.level()

def set_universal_compaction_options(self, /, uco):

Sets the options needed to support Universal Style compactions.

def set_fifo_compaction_options(self, /, fco):

Sets the options for FIFO compaction style.

def set_unordered_write(self, /, unordered):

Sets unordered_write to true trades higher write throughput with relaxing the immutability guarantee of snapshots. This violates the repeatability one expects from ::Get from a snapshot, as well as :MultiGet and Iterator's consistent-point-in-time view property. If the application cannot tolerate the relaxed guarantees, it can implement its own mechanisms to work around that and yet benefit from the higher throughput. Using TransactionDB with WRITE_PREPARED write policy and two_write_queues=true is one way to achieve immutable snapshots despite unordered_write.

By default, i.e., when it is false, rocksdb does not advance the sequence number for new snapshots unless all the writes with lower sequence numbers are already finished. This provides the immutability that we except from snapshots. Moreover, since Iterator and MultiGet internally depend on snapshots, the snapshot immutability results into Iterator and MultiGet offering consistent-point-in-time view. If set to true, although Read-Your-Own-Write property is still provided, the snapshot immutability property is relaxed: the writes issued after the snapshot is obtained (with larger sequence numbers) will be still not visible to the reads from that snapshot, however, there still might be pending writes (with lower sequence number) that will change the state visible to the snapshot after they are landed to the memtable.

Default: false

def set_max_subcompactions(self, /, num):

Sets maximum number of threads that will concurrently perform a compaction job by breaking it into multiple, smaller ones that are run simultaneously.

Default: 1 (i.e. no subcompactions)

def set_max_background_jobs(self, /, jobs):

Sets maximum number of concurrent background jobs (compactions and flushes).

Default: 2

Dynamically changeable through SetDBOptions() API.

def set_disable_auto_compactions(self, /, disable):

Disables automatic compactions. Manual compactions can still be issued on this column family

Default: false

Dynamically changeable through SetOptions() API

def set_memtable_huge_page_size(self, /, size):

SetMemtableHugePageSize sets the page size for huge page for arena used by the memtable. If <=0, it won't allocate from huge page but from malloc. Users are responsible to reserve huge pages for it to be allocated. For example: sysctl -w vm.nr_hugepages=20 See linux doc Documentation/vm/hugetlbpage.txt If there isn't enough free huge page available, it will fall back to malloc.

Dynamically changeable through SetOptions() API

def set_max_successive_merges(self, /, num):

Sets the maximum number of successive merge operations on a key in the memtable.

When a merge operation is added to the memtable and the maximum number of successive merges is reached, the value of the key will be calculated and inserted into the memtable instead of the merge operation. This will ensure that there are never more than max_successive_merges merge operations in the memtable.

Default: 0 (disabled)

def set_bloom_locality(self, /, v):

Control locality of bloom filter probes to improve cache miss rate. This option only applies to memtable prefix bloom and plaintable prefix bloom. It essentially limits the max number of cache lines each bloom filter check can touch.

This optimization is turned off when set to 0. The number should never be greater than number of probes. This option can boost performance for in-memory workload but should use with care since it can cause higher false positive rate.

Default: 0

def set_inplace_update_support(self, /, enabled):

Enable/disable thread-safe inplace updates.

Requires updates if

  • key exists in current memtable
  • new sizeof(new_value) <= sizeof(old_value)
  • old_value for that key is a put i.e. kTypeValue

Default: false.

def set_inplace_update_locks(self, /, num):

Sets the number of locks used for inplace update.

Default: 10000 when inplace_update_support = true, otherwise 0.

def set_max_bytes_for_level_multiplier_additional(self, /, level_values):

Different max-size multipliers for different levels. These are multiplied by max_bytes_for_level_multiplier to arrive at the max-size of each level.

Default: 1

Dynamically changeable through SetOptions() API

def set_skip_checking_sst_file_sizes_on_db_open(self, /, value):

If true, then DB::Open() will not fetch and check sizes of all sst files. This may significantly speed up startup if there are many sst files, especially when using non-default Env with expensive GetFileSize(). We'll still check that all required sst files exist. If paranoid_checks is false, this option is ignored, and sst files are not checked at all.

Default: false

def set_max_write_buffer_size_to_maintain(self, /, size):

The total maximum size(bytes) of write buffers to maintain in memory including copies of buffers that have already been flushed. This parameter only affects trimming of flushed buffers and does not affect flushing. This controls the maximum amount of write history that will be available in memory for conflict checking when Transactions are used. The actual size of write history (flushed Memtables) might be higher than this limit if further trimming will reduce write history total size below this limit. For example, if max_write_buffer_size_to_maintain is set to 64MB, and there are three flushed Memtables, with sizes of 32MB, 20MB, 20MB. Because trimming the next Memtable of size 20MB will reduce total memory usage to 52MB which is below the limit, RocksDB will stop trimming.

When using an OptimisticTransactionDB: If this value is too low, some transactions may fail at commit time due to not being able to determine whether there were any write conflicts.

When using a TransactionDB: If Transaction::SetSnapshot is used, TransactionDB will read either in-memory write buffers or SST files to do write-conflict checking. Increasing this value can reduce the number of reads to SST files done for conflict detection.

Setting this value to 0 will cause write buffers to be freed immediately after they are flushed. If this value is set to -1, 'max_write_buffer_number * write_buffer_size' will be used.

Default: If using a TransactionDB/OptimisticTransactionDB, the default value will be set to the value of 'max_write_buffer_number * write_buffer_size' if it is not explicitly set by the user. Otherwise, the default is 0.

def set_enable_pipelined_write(self, /, value):

By default, a single write thread queue is maintained. The thread gets to the head of the queue becomes write batch group leader and responsible for writing to WAL and memtable for the batch group.

If enable_pipelined_write is true, separate write thread queue is maintained for WAL write and memtable write. A write thread first enter WAL writer queue and then memtable writer queue. Pending thread on the WAL writer queue thus only have to wait for previous writers to finish their WAL writing but not the memtable writing. Enabling the feature may improve write throughput and reduce latency of the prepare phase of two-phase commit.

Default: false

def set_memtable_factory(self, /, factory):

Defines the underlying memtable implementation. See official wiki for more information. Defaults to using a skiplist.

Example:

::

from rocksdict import Options, MemtableFactory
opts = Options()
factory = MemtableFactory.hash_skip_list(bucket_count=1_000_000,
                                         height=4,
                                         branching_factor=4)

opts.set_allow_concurrent_memtable_write(false)
opts.set_memtable_factory(factory)
def set_block_based_table_factory(self, /, factory):
def set_cuckoo_table_factory(self, /, factory):

Sets the table factory to a CuckooTableFactory (the default table factory is a block-based table factory that provides a default implementation of TableBuilder and TableReader with default BlockBasedTableOptions). See official wiki for more information on this table format.

Example:

::

from rocksdict import Options, CuckooTableOptions

opts = Options()
factory_opts = CuckooTableOptions()
factory_opts.set_hash_ratio(0.8)
factory_opts.set_max_search_depth(20)
factory_opts.set_cuckoo_block_size(10)
factory_opts.set_identity_as_first_hash(true)
factory_opts.set_use_module_hash(false)

opts.set_cuckoo_table_factory(factory_opts)
def set_plain_table_factory(self, /, options):

This is a factory that provides TableFactory objects. Default: a block-based table factory that provides a default implementation of TableBuilder and TableReader with default BlockBasedTableOptions. Sets the factory as plain table. See official wiki for more information.

Example:

::

from rocksdict import Options, PlainTableFactoryOptions

opts = Options()
factory_opts = PlainTableFactoryOptions()
factory_opts.user_key_length = 0
factory_opts.bloom_bits_per_key = 20
factory_opts.hash_table_ratio = 0.75
factory_opts.index_sparseness = 16

opts.set_plain_table_factory(factory_opts)
def set_min_level_to_compress(self, /, lvl):

Sets the start level to use compression.

def set_report_bg_io_stats(self, /, enable):

Measure IO stats in compactions and flushes, if true.

Default: false

def set_max_total_wal_size(self, /, size):

Once write-ahead logs exceed this size, we will start forcing the flush of column families whose memtables are backed by the oldest live WAL file (i.e. the ones that are causing all the space amplification).

Default: 0

def set_wal_recovery_mode(self, /, mode):

Recovery mode to control the consistency while replaying WAL.

Default: DBRecoveryMode::PointInTime

def enable_statistics(self, /):
def get_statistics(self, /):
def set_stats_dump_period_sec(self, /, period):

If not zero, dump rocksdb.stats to LOG every stats_dump_period_sec.

Default: 600 (10 mins)

def set_stats_persist_period_sec(self, /, period):

If not zero, dump rocksdb.stats to RocksDB to LOG every stats_persist_period_sec.

Default: 600 (10 mins)

def set_advise_random_on_open(self, /, advise):

When set to true, reading SST files will opt out of the filesystem's readahead. Setting this to false may improve sequential iteration performance.

Default: true

def set_use_adaptive_mutex(self, /, enabled):

Enable/disable adaptive mutex, which spins in the user space before resorting to kernel.

This could reduce context switch when the mutex is not heavily contended. However, if the mutex is hot, we could end up wasting spin time.

Default: false

def set_num_levels(self, /, n):

Sets the number of levels for this database.

def set_memtable_prefix_bloom_ratio(self, /, ratio):

When a prefix_extractor is defined through opts.set_prefix_extractor this creates a prefix bloom filter for each memtable with the size of write_buffer_size * memtable_prefix_bloom_ratio (capped at 0.25).

Default: 0

def set_max_compaction_bytes(self, /, nbytes):

Sets the maximum number of bytes in all compacted files. We try to limit number of bytes in one compaction to be lower than this threshold. But it's not guaranteed.

Value 0 will be sanitized.

Default: target_file_size_base * 25

def set_wal_dir(self, /, path):

Specifies the absolute path of the directory the write-ahead log (WAL) should be written to.

Default: same directory as the database

def set_wal_ttl_seconds(self, /, secs):

Sets the WAL ttl in seconds.

The following two options affect how archived logs will be deleted.

  1. If both set to 0, logs will be deleted asap and will not get into the archive.
  2. If wal_ttl_seconds is 0 and wal_size_limit_mb is not 0, WAL files will be checked every 10 min and if total size is greater then wal_size_limit_mb, they will be deleted starting with the earliest until size_limit is met. All empty files will be deleted.
  3. If wal_ttl_seconds is not 0 and wall_size_limit_mb is 0, then WAL files will be checked every wal_ttl_seconds / 2 and those that are older than wal_ttl_seconds will be deleted.
  4. If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl being first.

Default: 0

def set_wal_size_limit_mb(self, /, size):

Sets the WAL size limit in MB.

If total size of WAL files is greater then wal_size_limit_mb, they will be deleted starting with the earliest until size_limit is met.

Default: 0

def set_manifest_preallocation_size(self, /, size):

Sets the number of bytes to preallocate (via fallocate) the manifest files.

Default is 4MB, which is reasonable to reduce random IO as well as prevent overallocation for mounts that preallocate large amounts of data (such as xfs's allocsize option).

def set_skip_stats_update_on_db_open(self, /, skip):

If true, then DB::Open() will not update the statistics used to optimize compaction decision by loading table properties from many files. Turning off this feature will improve DBOpen time especially in disk environment.

Default: false

def set_keep_log_file_num(self, /, nfiles):

Specify the maximal number of info log files to be kept.

Default: 1000

def set_allow_mmap_writes(self, /, is_enabled):

Allow the OS to mmap file for writing.

Default: false

def set_allow_mmap_reads(self, /, is_enabled):

Allow the OS to mmap file for reading sst tables.

Default: false

def set_atomic_flush(self, /, atomic_flush):

Guarantee that all column families are flushed together atomically. This option applies to both manual flushes (db.flush()) and automatic background flushes caused when memtables are filled.

Note that this is only useful when the WAL is disabled. When using the WAL, writes are always consistent across column families.

Default: false

def set_row_cache(self, /, cache):

Sets global cache for table-level rows. Cache must outlive DB instance which uses it.

Default: null (disabled) Not supported in ROCKSDB_LITE mode!

def set_ratelimiter(self, /, rate_bytes_per_sec, refill_period_us, fairness):

Use to control write rate of flush and compaction. Flush has higher priority than compaction. If rate limiter is enabled, bytes_per_sync is set to 1MB by default.

Default: disable

def set_max_log_file_size(self, /, size):

Sets the maximal size of the info log file.

If the log file is larger than max_log_file_size, a new info log file will be created. If max_log_file_size is equal to zero, all logs will be written to one log file.

Default: 0

Example:

::

from rocksdict import Options

options = Options()
options.set_max_log_file_size(0)
def set_log_file_time_to_roll(self, /, secs):

Sets the time for the info log file to roll (in seconds).

If specified with non-zero value, log file will be rolled if it has been active longer than log_file_time_to_roll. Default: 0 (disabled)

def set_recycle_log_file_num(self, /, num):

Controls the recycling of log files.

If non-zero, previously written log files will be reused for new logs, overwriting the old data. The value indicates how many such files we will keep around at any point in time for later use. This is more efficient because the blocks are already allocated and fdatasync does not need to update the inode after each write.

Default: 0

Example:

::

from rocksdict import Options

options = Options()
options.set_recycle_log_file_num(5)
def set_soft_pending_compaction_bytes_limit(self, /, limit):

Sets the threshold at which all writes will be slowed down to at least delayed_write_rate if estimated bytes needed to be compaction exceed this threshold.

Default: 64GB

def set_hard_pending_compaction_bytes_limit(self, /, limit):

Sets the bytes threshold at which all writes are stopped if estimated bytes needed to be compaction exceed this threshold.

Default: 256GB

def set_arena_block_size(self, /, size):

Sets the size of one block in arena memory allocation.

If <= 0, a proper value is automatically calculated (usually 1/10 of writer_buffer_size).

Default: 0

def set_dump_malloc_stats(self, /, enabled):

If true, then print malloc stats together with rocksdb.stats when printing to LOG.

Default: false

def set_memtable_whole_key_filtering(self, /, whole_key_filter):

Enable whole key bloom filter in memtable. Note this will only take effect if memtable_prefix_bloom_size_ratio is not 0. Enabling whole key filtering can potentially reduce CPU usage for point-look-ups.

Default: false (disable)

Dynamically changeable through SetOptions() API

class ReadOptions:

ReadOptions allows setting iterator bounds and so on.

Arguments:
  • raw_mode (bool): this must be the same as Options raw_mode argument.
ReadOptions()
def fill_cache(self, /, v):

Specify whether the "data block"/"index block"/"filter block" read for this iteration should be cached in memory? Callers may wish to set this field to false for bulk scans.

Default: true

def set_iterate_upper_bound(self, /, key):

Sets the upper bound for an iterator.

def set_iterate_lower_bound(self, /, key):

Sets the lower bound for an iterator.

def set_prefix_same_as_start(self, /, v):

Enforce that the iterator only iterates over the same prefix as the seek. This option is effective only for prefix seeks, i.e. prefix_extractor is non-null for the column family and total_order_seek is false. Unlike iterate_upper_bound, prefix_same_as_start only works within a prefix but in both directions.

Default: false

def set_total_order_seek(self, /, v):

Enable a total order seek regardless of index format (e.g. hash index) used in the table. Some table format (e.g. plain table) may not support this option.

If true when calling Get(), we also skip prefix bloom when reading from block based table. It provides a way to read existing data after changing implementation of prefix extractor.

def set_max_skippable_internal_keys(self, /, num):

Sets a threshold for the number of keys that can be skipped before failing an iterator seek as incomplete. The default value of 0 should be used to never fail a request as incomplete, even on skipping too many keys.

Default: 0

def set_background_purge_on_iterator_cleanup(self, /, v):

If true, when PurgeObsoleteFile is called in CleanupIteratorState, we schedule a background job in the flush job queue and delete obsolete files in background.

Default: false

def set_ignore_range_deletions(self, /, v):

If true, keys deleted using the DeleteRange() API will be visible to readers until they are naturally deleted during compaction. This improves read performance in DBs with many range deletions.

Default: false

def set_verify_checksums(self, /, v):

If true, all data read from underlying storage will be verified against corresponding checksums.

Default: true

def set_readahead_size(self, /, v):

If non-zero, an iterator will create a new table reader which performs reads of the given size. Using a large size (> 2MB) can improve the performance of forward iteration on spinning disks. Default: 0

from rocksdict import ReadOptions

opts = ReadOptions() opts.set_readahead_size(4_194_304) # 4mb

def set_tailing(self, /, v):

If true, create a tailing iterator. Note that tailing iterators only support moving in the forward direction. Iterating in reverse or seek_to_last are not supported.

def set_pin_data(self, /, v):

Specifies the value of "pin_data". If true, it keeps the blocks loaded by the iterator pinned in memory as long as the iterator is not deleted, If used when reading from tables created with BlockBasedTableOptions::use_delta_encoding = false, Iterator's property "rocksdb.iterator.is-key-pinned" is guaranteed to return 1.

Default: false

def set_async_io(self, /, v):

Asynchronously prefetch some data.

Used for sequential reads and internal automatic prefetching.

Default: false

class ColumnFamily:

Column family handle. This can be used in WriteBatch to specify Column Family.

ColumnFamily()
class IngestExternalFileOptions:
IngestExternalFileOptions()
def set_move_files(self, /, v):

Can be set to true to move the files instead of copying them.

def set_snapshot_consistency(self, /, v):

If set to false, an ingested file keys could appear in existing snapshots that where created before the file was ingested.

def set_allow_global_seqno(self, /, v):

If set to false, IngestExternalFile() will fail if the file key range overlaps with existing keys or tombstones in the DB.

def set_allow_blocking_flush(self, /, v):

If set to false and the file key range overlaps with the memtable key range (memtable flush required), IngestExternalFile will fail.

def set_ingest_behind(self, /, v):

Set to true if you would like duplicate keys in the file being ingested to be skipped rather than overwriting existing data under that key. Usecase: back-fill of some historical data in the database without over-writing existing newer version of data. This option could only be used if the DB has been running with allow_ingest_behind=true since the dawn of time. All files will be ingested at the bottommost level with seqno=0.

class DBPath:
DBPath()
class MemtableFactory:

Defines the underlying memtable implementation. See official wiki for more information.

MemtableFactory()
def vector():
def hash_skip_list(bucket_count, height, branching_factor):
class BlockBasedOptions:

For configuring block-based file storage.

BlockBasedOptions()
def set_block_size(self, /, size):

Approximate size of user data packed per block. Note that the block size specified here corresponds to uncompressed data. The actual size of the unit read from disk may be smaller if compression is enabled. This parameter can be changed dynamically.

def set_metadata_block_size(self, /, size):

Block size for partitioned metadata. Currently applied to indexes when kTwoLevelIndexSearch is used and to filters when partition_filters is used. Note: Since in the current implementation the filters and index partitions are aligned, an index/filter block is created when either index or filter block size reaches the specified limit.

Note: this limit is currently applied to only index blocks; a filter partition is cut right after an index block is cut.

def set_partition_filters(self, /, size):

Note: currently this option requires kTwoLevelIndexSearch to be set as well.

Use partitioned full filters for each SST file. This option is incompatible with block-based filters.

def set_block_cache(self, /, cache):

Sets global cache for blocks (user data is stored in a set of blocks, and a block is the unit of reading from disk). Cache must outlive DB instance which uses it.

If set, use the specified cache for blocks. By default, rocksdb will automatically create and use an 8MB internal cache.

def disable_cache(self, /):

Disable block cache

def set_bloom_filter(self, /, bits_per_key, block_based):

Sets the filter policy to reduce disk read

def set_cache_index_and_filter_blocks(self, /, v):
def set_index_type(self, /, index_type):

Defines the index type to be used for SS-table lookups.

Example:

::

from rocksdict import BlockBasedOptions, BlockBasedIndexType, Options

opts = Options()
block_opts = BlockBasedOptions()
block_opts.set_index_type(BlockBasedIndexType.hash_search())
opts.set_block_based_table_factory(block_opts)
def set_pin_l0_filter_and_index_blocks_in_cache(self, /, v):

If cache_index_and_filter_blocks is true and the below is true, then filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed.

Default: false.

def set_pin_top_level_index_and_filter(self, /, v):

If cache_index_and_filter_blocks is true and the below is true, then the top-level index of partitioned filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed. This is not limited to l0 in LSM tree.

Default: false.

def set_format_version(self, /, version):

Format version, reserved for backward compatibility.

See full list of the supported versions.

Default: 2.

def set_block_restart_interval(self, /, interval):

Number of keys between restart points for delta encoding of keys. This parameter can be changed dynamically. Most clients should leave this parameter alone. The minimum value allowed is 1. Any smaller value will be silently overwritten with 1.

Default: 16.

def set_index_block_restart_interval(self, /, interval):

Same as block_restart_interval but used for the index block. If you don't plan to run RocksDB before version 5.16 and you are using index_block_restart_interval > 1, you should probably set the format_version to >= 4 as it would reduce the index size.

Default: 1.

def set_data_block_index_type(self, /, index_type):
Set the data block index type for point lookups:

DataBlockIndexType::BinarySearch to use binary search within the data block. DataBlockIndexType::BinaryAndHash to use the data block hash index in combination with the normal binary search.

The hash table utilization ratio is adjustable using set_data_block_hash_ratio, which is valid only when using DataBlockIndexType::BinaryAndHash.

Default: BinarySearch

Example:

::

from rocksdict import BlockBasedOptions, BlockBasedIndexType, Options

opts = Options()
block_opts = BlockBasedOptions()
block_opts.set_data_block_index_type(DataBlockIndexType.binary_and_hash())
block_opts.set_data_block_hash_ratio(0.85)
opts.set_block_based_table_factory(block_opts)
def set_data_block_hash_ratio(self, /, ratio):

Set the data block hash index utilization ratio.

The smaller the utilization ratio, the less hash collisions happen, and so reduce the risk for a point lookup to fall back to binary search due to the collisions. A small ratio means faster lookup at the price of more space overhead.

Default: 0.75

def set_checksum_type(self, /, checksum_type):

Use the specified checksum type. Newly created table files will be protected with this checksum type. Old table files will still be readable, even though they have different checksum type.

class PlainTableFactoryOptions:

Used with DBOptions::set_plain_table_factory. See official wiki for more information.

Defaults:

user_key_length: 0 (variable length) bloom_bits_per_key: 10 hash_table_ratio: 0.75 index_sparseness: 16

PlainTableFactoryOptions()
class CuckooTableOptions:

Configuration of cuckoo-based storage.

CuckooTableOptions()
def set_hash_ratio(self, /, ratio):

Determines the utilization of hash tables. Smaller values result in larger hash tables with fewer collisions. Default: 0.9

def set_max_search_depth(self, /, depth):

A property used by builder to determine the depth to go to to search for a path to displace elements in case of collision. See Builder.MakeSpaceForKey method. Higher values result in more efficient hash tables with fewer lookups but take more time to build. Default: 100

def set_cuckoo_block_size(self, /, size):

In case of collision while inserting, the builder attempts to insert in the next cuckoo_block_size locations before skipping over to the next Cuckoo hash function. This makes lookups more cache friendly in case of collisions. Default: 5

def set_identity_as_first_hash(self, /, flag):

If this option is enabled, user key is treated as uint64_t and its value is used as hash value directly. This option changes builder's behavior. Reader ignore this option and behave according to what specified in table property. Default: false

def set_use_module_hash(self, /, flag):

If this option is set to true, module is used during hash calculation. This often yields better space efficiency at the cost of performance. If this option is set to false, # of entries in table is constrained to be power of two, and bit and is used to calculate hash, which is faster in general. Default: true

class UniversalCompactOptions:
UniversalCompactOptions()
max_size_amplification_percent

sets the size amplification.

It is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification.

Default: 200, which means that a 100 byte database could require upto 300 bytes of storage.

compression_size_percent

Sets the percentage of compression size.

If this option is set to be -1, all the output files will follow compression type specified.

If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed. When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time: A1...An B1...Bm C1...Ct where A1 is the newest and Ct is the oldest, and we are going to compact B1...Bm, we calculate the total size of all the files as total_size, as well as the total size of C1...Ct as total_C, the compaction output file will be compressed iff total_C / total_size < this percentage

Default: -1

stop_style

Sets the algorithm used to stop picking files into a single compaction run.

Default: ::Total

size_ratio

Sets the percentage flexibility while comparing file size. If the candidate file(s) size is 1% smaller than the next file's size, then include next file into this candidate set.

Default: 1

min_merge_width

Sets the minimum number of files in a single compaction run.

Default: 2

max_merge_width

Sets the maximum number of files in a single compaction run.

Default: UINT_MAX

class UniversalCompactionStopStyle:
UniversalCompactionStopStyle()
def similar():
def total():
class SliceTransform:
SliceTransform()
def create_fixed_prefix(len):
def create_max_len_prefix(len):

prefix max length at len. If key is longer than len, the prefix will have length len, if key is shorter than len, the prefix will have the same length as len.

def create_noop():
class DataBlockIndexType:
DataBlockIndexType()
def binary_and_hash():

Appends a compact hash table to the end of the data block for efficient indexing. Backwards compatible with databases created without this feature. Once turned on, existing data will be gradually converted to the hash index format.

class BlockBasedIndexType:
BlockBasedIndexType()
class Cache:
Cache()
def new_hyper_clock_cache(capacity, estimated_entry_charge):

Creates a HyperClockCache with capacity in bytes.

estimated_entry_charge is an important tuning parameter. The optimal choice at any given time is (cache.get_usage() - 64 * cache.get_table_address_count()) / cache.get_occupancy_count(), or approximately cache.get_usage() / cache.get_occupancy_count().

However, the value cannot be changed dynamically, so as the cache composition changes at runtime, the following tradeoffs apply:

  • If the estimate is substantially too high (e.g., 25% higher), the cache may have to evict entries to prevent load factors that would dramatically affect lookup times.
  • If the estimate is substantially too low (e.g., less than half), then meta data space overhead is substantially higher.

The latter is generally preferable, and picking the larger of block size and meta data block size is a reasonable choice that errs towards this side.

def get_usage(self, /):

Returns the Cache memory usage

def get_pinned_usage(self, /):

Returns pinned memory usage

def set_capacity(self, /, capacity):

Sets cache capacity

class ChecksumType:

Used by BlockBasedOptions::set_checksum_type.

Call the corresponding functions of each to get one of the following.

  • NoChecksum
  • CRC32c
  • XXHash
  • XXHash64
  • XXH3
ChecksumType()
def no_checksum():
def crc32c():
def xxhash():
def xxhash64():
def xxh3():
class DBCompactionStyle:

This is to be treated as an enum.

Call the corresponding functions of each to get one of the following.

  • Level
  • Universal
  • Fifo

Below is an example to set compaction style to Fifo.

Example:

::

opt = Options()
opt.set_compaction_style(DBCompactionStyle.fifo())
DBCompactionStyle()
def level():
def universal():
def fifo():
class DBCompressionType:

This is to be treated as an enum.

Call the corresponding functions of each to get one of the following.

  • None
  • Snappy
  • Zlib
  • Bz2
  • Lz4
  • Lz4hc
  • Zstd

Below is an example to set compression type to Snappy.

Example:

::

opt = Options()
opt.set_compression_type(DBCompressionType.snappy())
DBCompressionType()
def none():
def snappy():
def zlib():
def bz2():
def lz4():
def lz4hc():
def zstd():
class DBRecoveryMode:

This is to be treated as an enum.

Calling the corresponding functions of each to get one of the following.

  • TolerateCorruptedTailRecords
  • AbsoluteConsistency
  • PointInTime
  • SkipAnyCorruptedRecord

Below is an example to set recovery mode to PointInTime.

Example:

::

opt = Options()
opt.set_wal_recovery_mode(DBRecoveryMode.point_in_time())
DBRecoveryMode()
def tolerate_corrupted_tail_records():
def absolute_consistency():
def point_in_time():
def skip_any_corrupted_record():
class Env:
Env()
def mem_env():

Returns a new environment that stores its data in memory and delegates all non-file-storage tasks to base_env.

def set_background_threads(self, /, num_threads):

Sets the number of background worker threads of a specific thread pool for this environment. LOW is the default pool.

Default: 1

def set_high_priority_background_threads(self, /, n):

Sets the size of the high priority thread pool that can be used to prevent compactions from stalling memtable flushes.

def set_low_priority_background_threads(self, /, n):

Sets the size of the low priority thread pool that can be used to prevent compactions from stalling memtable flushes.

def set_bottom_priority_background_threads(self, /, n):

Sets the size of the bottom priority thread pool that can be used to prevent compactions from stalling memtable flushes.

def join_all_threads(self, /):

Wait for all threads started by StartThread to terminate.

def lower_thread_pool_io_priority(self, /):

Lowering IO priority for threads from the specified pool.

def lower_high_priority_thread_pool_io_priority(self, /):

Lowering IO priority for high priority thread pool.

def lower_thread_pool_cpu_priority(self, /):

Lowering CPU priority for threads from the specified pool.

def lower_high_priority_thread_pool_cpu_priority(self, /):

Lowering CPU priority for high priority thread pool.

class FifoCompactOptions:
FifoCompactOptions()
max_table_files_size

Sets the max table file size.

Once the total sum of table files reaches this, we will delete the oldest table file

Default: 1GB

class CompactOptions:
CompactOptions()
def set_exclusive_manual_compaction(self, /, v):

If more than one thread calls manual compaction, only one will actually schedule it while the other threads will simply wait for the scheduled manual compaction to complete. If exclusive_manual_compaction is set to true, the call will disable scheduling of automatic compaction jobs and wait for existing automatic compaction jobs to finish.

def set_bottommost_level_compaction(self, /, lvl):

Sets bottommost level compaction.

def set_change_level(self, /, v):

If true, compacted files will be moved to the minimum level capable of holding the data or given level (specified non-negative target_level).

def set_target_level(self, /, lvl):

If change_level is true and target_level have non-negative value, compacted files will be moved to target_level.

class BottommostLevelCompaction:
BottommostLevelCompaction()
def skip():

Skip bottommost level compaction

def if_have_compaction_filter():

Only compact bottommost level if there is a compaction filter This is the default option

def force():

Always compact bottommost level

def force_optimized():

Always compact bottommost level but in bottommost level avoid double-compacting files created in the same compaction

class KeyEncodingType:
KeyEncodingType()
def plain():

Always write full keys.

def prefix():

Find opportunities to write the same prefix for multiple rows.