Skip to content

Fix missing move of query_metadata_cache in BlockIO::operator=#96995

Open
alexey-milovidov wants to merge 5 commits intomasterfrom
fix-blockio-move-query-metadata-cache
Open

Fix missing move of query_metadata_cache in BlockIO::operator=#96995
alexey-milovidov wants to merge 5 commits intomasterfrom
fix-blockio-move-query-metadata-cache

Conversation

@alexey-milovidov
Copy link
Member

Summary

BlockIO::operator=(BlockIO&&) was not moving query_metadata_cache, causing the cache to be destroyed prematurely on every query through both TCP and HTTP paths.

Closes #95742

Root Cause Analysis

Bug #1: BlockIO::operator= does NOT move query_metadata_cache

In src/QueryPipeline/BlockIO.cpp:26-44:

BlockIO & BlockIO::operator= (BlockIO && rhs) noexcept
{
    reset();
    process_list_entries    = std::move(rhs.process_list_entries);
    pipeline                = std::move(rhs.pipeline);
    finalize_query_pipeline = std::move(rhs.finalize_query_pipeline);
    finish_callbacks        = std::move(rhs.finish_callbacks);
    exception_callbacks     = std::move(rhs.exception_callbacks);
    null_format             = rhs.null_format;
    // query_metadata_cache is NOT moved!
    return *this;
}

This is triggered on every query via both execution paths:

  • TCP path (executeQuery.cpp:1970): res = executeQueryImpl(...)
  • HTTP path (executeQuery.cpp:2168): streams = executeQueryImpl(...)

The consequence: executeQueryImpl packs the cache into its returned BlockIO, but operator= strips it out. The temp BlockIO is destroyed immediately, destroying the cache while the pipeline lives on in res.

What happens when the cache is destroyed prematurely

For mutation validation queries (ALTER TABLE ... UPDATE):

  1. MutationsInterpreter::validate() builds a validation pipeline internally, which caches a StorageSnapshotPtr via getStorageSnapshot in the QueryMetadataCache
  2. validate() returns — its internal pipeline is destroyed, releasing its StorageSnapshotPtr
  3. The cache entry is now the ONLY remaining StorageSnapshotPtr (refcount = 1)
  4. InterpreterAlterQuery::execute() returns an empty BlockIO (no pipeline)
  5. executeQueryImpl packs the cache into the returned BlockIO
  6. res = executeQueryImpl(...)operator= moves the pipeline (empty) but NOT the cache
  7. The temp BlockIO is destroyed → cache destroyed → last StorageSnapshotPtr released
  8. ~StorageSnapshot~SnapshotData:
    • parts released → parts freed → clearCaches() → accesses storage

Why SnapshotData::storage doesn't always save us

Within SnapshotData destruction, parts (line 623) is freed before storage (line 619). So normally the storage IS alive when clearCaches runs.

But there's a complicating factor: MergeTreeData::shared_ranges_in_parts (line 1496) may hold the same RangesInDataPartsPtr. When both SnapshotData::parts and shared_ranges_in_parts share the same shared_ptr, the destruction chain becomes:

  1. SnapshotData::parts released → refcount 2→1 (not freed, shared_ranges_in_parts still holds it)
  2. SnapshotData::storage released → if table was DETACHED, this is the last ref → MergeTreeData destructor runs
  3. During ~MergeTreeData(), shared_ranges_in_parts (line 1496) is destroyed → parts freed → clearCaches() → accesses this (the MergeTreeData being destroyed)

At this point we're inside MergeTreeData's destructor. The StorageMergeTree/StorageReplicatedMergeTree destructor has already run (derived class destructors run first). If shutdown() invalidated any state that clearCaches depends on (like context caches, virtual table pointers after derived destructor), this could SEGFAULT.

Bug #2: InterpreterOptimizeQuery mutably modifies cached snapshot

In src/Interpreters/InterpreterOptimizeQuery.cpp:81-82:

if (auto * snapshot_data = dynamic_cast<MergeTreeData::SnapshotData *>(storage_snapshot->data.get()))
    snapshot_data->parts = {};

This clears parts on a cached StorageSnapshot via a mutable cast, while storage remains set. This is a data race on shared state and could cause parts to be freed unexpectedly.

Historical context

Azat Khuzhin attempted a different fix in Jan 2026:

  • 1002f7ce907 — Removed SnapshotData::storage, added BlockIO::resetPipeline() to destroy cache before pipeline
  • d02700909db — Reordered BlockIO fields to match
  • 8c44f5e9374Reverted both, putting SnapshotData::storage back

The revert restored the status quo, but the BlockIO::operator= bug (which predates all this) was never fixed.

Changelog category:

  • Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry:

Fix crash (SEGFAULT) in clearCaches caused by BlockIO::operator= not moving query_metadata_cache, leading to premature destruction of cached storage snapshots and use-after-free of MergeTreeData storage.

🤖 Generated with Claude Code

Closes #95742

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Feb 15, 2026

Workflow [PR], commit [5e70745]

@clickhouse-gh clickhouse-gh bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Feb 15, 2026
…cache

Concurrent mutations, selects, and detach/attach operations exercise the
race where the cache is destroyed prematurely while the storage is being
removed from the database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# DETACH removes the storage from the database, dropping its StoragePtr.
# If the cache snapshot holds the last ref, destroying the cache will
# free the storage while parts still exist in shared_ranges_in_parts.
$CLICKHOUSE_CLIENT --query "DETACH TABLE ${TABLE}" 2>/dev/null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another proof for #96130

@alexey-milovidov
Copy link
Member Author

The bug was introduced here: #92118, CC @amosbird
It hit only due to the existence of this optimization: #85535, CC @jawm

@alexey-milovidov
Copy link
Member Author

Related PRs and Issues

Previous fix attempts (all reverted)

PR Author Title State
#95074 alexey-milovidov Fix segfault in clearCaches when table is dropped during query Merged, then reverted
#95120 alexey-milovidov Revert "Fix segfault in clearCaches when table is dropped during query" Merged
#95393 azat Fix order of destruction in MergeTreeReadPoolBase Merged (survived)
#95396 azat Do not hold a storage in storage snapshot Merged, then reverted
#95594 azat Revert "Do not hold a storage in storage snapshot" Merged

Origin of the bug

PR Author Title
#92118 Amos Bird Proper ownership of query metadata cache (introduced query_metadata_cache in BlockIO but operator= was not updated)
#85535 James Morrison Optimize selection of part list during query planning (introduced shared_ranges_in_parts / shared_parts_list — a contributing factor)

Issues

Issue Title State
#95742 Flaky test: 01076_parallel_alter_replicated_zookeeper Open
#94661 [CI crash] Incorrect destruction of MergeTreeDataPartCompact caches Closed
#91989 SIGSEGV in IMergeTreeDataPart::clearCaches during executeReplaceRange Closed

…opTablesParallel` path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexey-milovidov
Copy link
Member Author

MSan stress test report — same root cause confirmed

Report URL: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=e4d1f04e6bd203d4ccb73f35eac78620ea7cfabb&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29&name_1=Stress%20test%20%28azure%2C%20amd_msan%29

Analysis

The MSan trace from the Stress test (azure, amd_msan) job shows the same root cause as the original SIGSEGV in clearCaches, but manifests differently due to MSan catching the use-after-free earlier.

Destruction path shown in the MSan trace:

  1. DatabaseCatalog::dropTablesParallel frees the storage on a background thread
  2. Meanwhile, on the TCPHandler thread, the pipeline is being destroyed via BlockIO::onExceptionpipeline.reset()
  3. The pool's parts still reference the freed storage memory via the bare IMergeTreeDataPart::storage reference (const MergeTreeData &)
  4. When clearCaches is called during part destruction, it accesses storage.getPrimaryIndexCache() — hitting freed memory

Why our fix addresses this:

The BlockIO::operator=(BlockIO&&) bug causes the query_metadata_cache (which holds StorageSnapshotPtr entries keeping the storage alive) to be destroyed prematurely when the temporary BlockIO from executeQueryImpl is assigned to the caller's BlockIO variable. Without this cache, the SnapshotData::storage (ConstStoragePtr) — the last strong reference keeping the storage alive — gets freed. This allows DatabaseCatalog::dropTablesParallel to finalize storage destruction on its background thread while the TCPHandler thread's pipeline still holds parts referencing the storage.

By moving query_metadata_cache in BlockIO::operator=, the cache survives for the lifetime of the pipeline, ensuring SnapshotData::storage keeps the storage alive until all parts are properly destroyed.

alexey-milovidov and others added 2 commits February 15, 2026 17:14
The `select_thread` function outputs `count()` results to stdout, polluting
the test output with `0` values that don't match the reference file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: 01076_parallel_alter_replicated_zookeeper

1 participant