Designing an Oracle 26ai Vector Store with Flexible Metadata Filters
1. The problem you are actually solving
If you are building a product similar to OpenAI Vector Store, the product requirement often sounds simple: let each user attach arbitrary metadata to documents or files, then support metadata filtering together with semantic vector search.
At the API level, this usually looks like a small map of key-value pairs. In OpenAI’s public interface, vector store file objects support attributes, and those attributes can later be used in queries. The public contract is intentionally simple. It allows a limited number of key-value pairs, and the values are restricted to simple scalar types such as strings, numbers, and booleans.
That product contract does not imply that the database must create a separate index for every user-defined key. In Oracle 26ai, the right storage and indexing design is different. Oracle gives you JSON storage, JSON search indexes, function-based indexes on json_value, vector indexes such as HNSW and IVF, hybrid vector indexes, optimizer hints, and plan inspection tools.
The correct design principle is straightforward. Store flexible metadata in JSON. Use one general JSON index to support unknown and long-tail fields. Promote only a small number of hot fields into more targeted structures such as function-based indexes or virtual columns.
2. The key distinction: one broad JSON index versus one narrow path index
A common misunderstanding is to confuse two different kinds of indexing.
The first is indexing the JSON column as a whole. In Oracle, a JSON search index is a general-purpose index over the JSON document. It is meant for ad hoc structural queries and full-text search. By default, it indexes all leaf fields in the document, including both textual values and numeric ranges. This means you do not need to know every field name in advance.
The second is indexing a specific JSON path. Oracle supports B-tree function-based indexes built on expressions such as:
json_value(attrs, '$.type' RETURNING VARCHAR2(50) ERROR ON ERROR NULL ON EMPTY)
This kind of index is for fields you already know matter, such as type, age, release_year, or organization.
These two approaches are complementary. A JSON search index is the broad baseline for flexible metadata. A function-based index is the precise optimization for a known query pattern.
3. Why arbitrary user metadata is still indexable
Consider three different documents with three different metadata shapes:
{"type":"teacher","age":30}
{"city":"DC","score":100}
{"org":"SLQ","release_year":2022}
This is still indexable. Flexible JSON does not force you into either full scans or one custom index per user-defined key. Oracle’s JSON search index is designed for exactly this situation. It allows one general index over a schema-flexible JSON column so that different rows can carry different keys.
That is the practical middle ground. You do not need to predefine every metadata field. You also do not need to give up indexing. You start with one general JSON search index over the metadata column, then add more specialized indexes only for fields that become important in real workloads.
4. Use a two-layer schema when metadata is document-level
If many chunks from the same document share the same metadata, duplicating the same JSON on every chunk row is usually the wrong design. A cleaner structure is to separate document-level metadata from chunk-level vectors.
A practical schema looks like this:
CREATE TABLE docs (
user_id NUMBER NOT NULL,
doc_id NUMBER NOT NULL,
attrs JSON NOT NULL,
PRIMARY KEY (user_id, doc_id)
);
CREATE TABLE chunks (
user_id NUMBER NOT NULL,
doc_id NUMBER NOT NULL,
chunk_id NUMBER NOT NULL,
chunk_text CLOB NOT NULL,
embedding VECTOR(...) NOT NULL,
PRIMARY KEY (user_id, doc_id, chunk_id),
FOREIGN KEY (user_id, doc_id) REFERENCES docs(user_id, doc_id)
);
This design matches the logical model of most retrieval systems. Metadata belongs to the document or file. Embeddings belong to the chunks. Queries can first constrain candidate documents through docs.attrs, then search only the chunks that belong to those documents.
This also matches the product model more naturally. In an OpenAI-style interface, attributes are attached to higher-level content objects such as files, not duplicated on every chunk.
5. The baseline indexes you should create first
The first indexes should be ordinary relational indexes. In a multi-tenant system, tenant scoping is the first and most important way to reduce the search space.
At minimum, you typically want:
-
an index on
user_idortenant_id -
an index on
(user_id, doc_id)where appropriate
This step is not optional. Vector and JSON features are not a substitute for basic relational selectivity.
Next, create one JSON search index on the document metadata column:
CREATE SEARCH INDEX idx_docs_attrs
ON docs(attrs)
FOR JSON;
This gives you a general index over flexible metadata.
Then create a vector index on the chunk embedding column. Depending on your workload and Oracle configuration, this may be HNSW or IVF.
The default architecture is therefore simple:
-
Relational indexes for tenant and join scope
-
One JSON search index for flexible metadata
-
One vector index for semantic retrieval
That is the right baseline before you start optimizing hot fields.
6. Can the JSON search index help with simple metadata filters?
Yes, often it can.
If you create a JSON search index on docs.attrs, then a filter such as:
json_value(attrs, '$.type' RETURNING VARCHAR2(50)) = 'teacher'
can often benefit from that general index even if you did not create a dedicated index on $.type.
The same applies to numeric filters such as:
json_value(attrs, '$.age' RETURNING NUMBER) = 20
If the JSON search index exists, that field is no longer unindexed in the practical sense. It is covered by the general JSON index. That is why a single JSON search index is such a strong baseline for unknown metadata schemas.
Still, you should be careful about wording. The right claim is not that Oracle will always use the JSON search index for every such predicate. The right claim is that these predicates are eligible to benefit from it, and the actual plan should be verified.
7. When a dedicated function-based index is better
If a field becomes a frequent filter, a targeted function-based index is usually the better choice.
For example:
CREATE INDEX idx_docs_type
ON docs (
json_value(
attrs,
'$.type'
RETURNING VARCHAR2(50)
ERROR ON ERROR
NULL ON EMPTY
)
);
This is the right move when type becomes a stable and important access path.
The same idea applies to numeric fields:
CREATE INDEX idx_docs_age
ON docs (
json_value(
attrs,
'$.age'
RETURNING NUMBER
ERROR ON ERROR
NULL ON EMPTY
)
);
If two fields are frequently queried together, a composite function-based index may be appropriate:
CREATE INDEX idx_docs_type_age
ON docs (
json_value(
attrs,
'$.type'
RETURNING VARCHAR2(50)
ERROR ON ERROR
NULL ON EMPTY
),
json_value(
attrs,
'$.age'
RETURNING NUMBER
ERROR ON ERROR
NULL ON EMPTY
)
);
This is not your starting point. It is an optimization step for established query patterns.
8. Single-field and multi-field filters follow the same logic
A JSON search index is broad enough to support both single-field and multi-field metadata predicates because it indexes all leaf fields by default unless you deliberately use path subsetting.
So these two kinds of predicates may both benefit from the same general JSON search index:
-
type = 'teacher' -
type = 'teacher' AND age = 30
That is exactly why one JSON search index works well as the default answer to flexible, unpredictable metadata filtering.
However, the moment a particular combination becomes frequent and performance-sensitive, that is your signal to consider a more specialized composite index.
9. The logical workflow and the physical execution plan are not the same
At the application level, people often think in this order:
-
filter by metadata
-
run vector search on the reduced subset
That is a valid business-level mental model. It is not a guarantee about physical execution.
In Oracle, the optimizer may choose different strategies depending on the vector index type and query shape.
For HNSW, Oracle documents alternatives such as pre-filtering and in-filtering. In pre-filtering, metadata filters are evaluated first, and vector traversal is restricted to rows that pass the filter. In in-filtering, Oracle traverses the vector index first and applies filters to the vector candidates.
For IVF, Oracle documents pre-filtering and post-filtering. In pre-filtering, filters are applied before distance computation over the candidates. In post-filtering, distance-based candidate selection comes first and the metadata filter is applied afterward.
So the correct statement is this: your application can think in terms of metadata first and vector second, but Oracle may physically execute the query in different orders.
10. SQL text order does not control execution order
You cannot force Oracle to filter before vector search simply by writing the WHERE clause before the ORDER BY VECTOR_DISTANCE(...) clause. SQL is declarative. The optimizer determines the physical plan.
Oracle provides supported ways to influence this behavior.
One option is the VECTOR_INDEX_TRANSFORM hint, which allows you to specify filter behavior such as pre-filter, in-filter, or post-filter variants depending on the index type.
Another option is DBMS_HYBRID_VECTOR.SEARCH, which accepts a JSON query specification and supports parameters such as filter_type and filter_by.
Even then, hints and API parameters should be treated as directives to validate, not assumptions to trust blindly. The final authority is the execution plan.
11. Explicitly Guiding Oracle to Filter Metadata Before Vector Search
While Oracle’s optimizer is sophisticated, there are scenarios—particularly with high-selectivity filters—where you want to explicitly ensure that the metadata filter is applied before the vector distance calculation. This is known as Pre-filtering.
In Oracle 26ai, you can achieve this through two primary methods: the VECTOR_INDEX_TRANSFORM optimizer hint and the DBMS_HYBRID_VECTOR API.
Using the Optimizer Hint
The most direct way to influence the execution plan in a standard SQL statement is to use the VECTOR_INDEX_TRANSFORM hint. This tells the optimizer to prioritize the filter logic before traversing the vector index.
SELECT /*+ VECTOR_INDEX_TRANSFORM(c.embedding, PRE_FILTER) */
c.doc_id, c.chunk_id, c.chunk_text
FROM docs d
JOIN chunks c
ON c.user_id = d.user_id
AND c.doc_id = d.doc_id
WHERE d.user_id = 1
-- The JSON metadata filter
AND json_value(d.attrs, '$.status' RETURNING VARCHAR2(20)) = 'published'
ORDER BY VECTOR_DISTANCE(c.embedding, :query_vector, COSINE)
FETCH APPROX FIRST 10 ROWS ONLY;
By specifying PRE_FILTER, you instruct Oracle to resolve the json_value predicate (potentially using the JSON search index) to narrow down the candidate doc_id list before the vector index is even touched.
Using the Hybrid Search API
If you are using the programmatic interface, the DBMS_HYBRID_VECTOR.SEARCH function allows you to define the filtering strategy as a parameter within the search specification.
{
"vector": {
"column": "embedding",
"value": [0.12, -0.05, ...],
"top_k": 10
},
"filter": {
"json_path": "$.status",
"operator": "=",
"value": "published",
"filter_type": "PRE_FILTER"
}
}
Setting "filter_type": "PRE_FILTER" in the JSON payload acts as a requirement for the engine, ensuring the metadata constraint is applied at the start of the execution pipeline.
When to Use "Pre-Filter" Hints
Explicitly forcing a pre-filter is highly effective when:
High Selectivity: Your metadata filter reduces the search space by a significant margin (e.g., filtering for a specific
user_idin a multi-tenant system).Performance Stability: You want to prevent the optimizer from occasionally choosing an "In-filtering" path that might be slower for specific data distributions.
12. How to filter on one JSON attribute without involving the others
Suppose a document-level metadata object looks like this:
{"type":"teacher","age":30,"city":"DC","score":100}
If you want to filter only by age, then the query should only reference $.age. The other attributes do not participate.
In the two-layer schema described earlier, the query would look like this:
SELECT c.doc_id, c.chunk_id, c.chunk_text
FROM docs d
JOIN chunks c
ON c.user_id = d.user_id
AND c.doc_id = d.doc_id
WHERE d.user_id = 1
AND json_value(
d.attrs,
'$.age'
RETURNING NUMBER
NULL ON EMPTY
) = 20
ORDER BY VECTOR_DISTANCE(c.embedding, :query_vector, COSINE)
FETCH APPROX FIRST 10 ROWS ONLY WITH TARGET ACCURACY 90;
This query filters on age only. It does not involve type, city, or score.
If you use DBMS_HYBRID_VECTOR.SEARCH, the same principle applies through path-based filtering. The filter only touches the path you specify.
13. How to know what Oracle actually did
You inspect the execution plan.
This is essential for both vector search and JSON filtering. You should verify at least two things:
-
Which filtering strategy Oracle chose for the vector portion
-
Whether Oracle actually used your JSON search index or function-based index
For SQL queries, inspect the plan with the usual Oracle tooling such as EXPLAIN PLAN and DBMS_XPLAN.DISPLAY.
For the hybrid JSON API, use DBMS_HYBRID_VECTOR.GET_SQL to retrieve the generated SQL, then inspect the plan for that SQL.
This is the practical discipline that keeps the design honest. Do not assume that an index was used just because it exists. Confirm it.
14. Hot field promotion is a system design decision
Your users may send any legal metadata key. That does not mean the system should create a dedicated index for each new field.
A better operating model is:
-
Store flexible metadata in JSON
-
Create one JSON search index as the baseline
-
Observe real query patterns
-
Promote only hot fields into specialized structures
In Oracle, Data Guide features support this pattern well. Scalar JSON fields can be projected as virtual columns. Those virtual columns can then be indexed and can also receive optimizer statistics.
This is what hot field promotion really means. It is not user-driven automatic index creation. It is workload-driven optimization.
15. Data typing matters more than many teams expect
If one document stores age as "20" and another stores it as 20, you create unnecessary complexity.
Oracle’s json_value indexing rules are type-sensitive. If you build a function-based index with one return type but query with another, the optimizer may not use the index. In addition, inconsistent scalar types can complicate virtual-column projection and downstream optimization.
A production system should therefore normalize metadata types early:
-
store numeric values as numbers
-
store booleans as booleans
-
store dates in a consistent, query-friendly format
Type discipline makes JSON search indexes, function-based indexes, and future virtual-column promotion much more reliable.
16. Path subsetting is useful, but it changes what is indexed
Oracle allows path subsetting for JSON search indexes. This means you can explicitly include or exclude certain JSON paths from the general index.
This is valuable when metadata becomes very wide, noisy, or operationally expensive to index in full. A narrower JSON search index may reduce index size and maintenance overhead.
But path subsetting changes semantics. If a path is excluded, predicates on that path will no longer benefit from the JSON search index.
So path subsetting is not a default recommendation. It is a later optimization for large and messy metadata payloads, and it should be based on observed workload, not guesswork.
17. Synchronization policy is a product decision as much as a database decision
By default, a JSON search index is maintained asynchronously and synchronized automatically in the background. Oracle also supports SYNC (ON COMMIT) when immediate visibility matters more than write throughput.
This is not just a technical tuning parameter. It affects product behavior.
If users expect newly uploaded metadata to be immediately filterable, then on-commit synchronization may matter. If ingestion throughput matters more and some lag is acceptable, background synchronization may be the better tradeoff.
A vector store is part database design and part product design. Index freshness sits exactly at that boundary.
18. Included columns in HNSW are useful for a small fixed set of fields
Oracle’s HNSW indexes support included columns. This can be useful when you have a small set of known, high-value attributes that you want available directly from the index structure.
That can reduce joins or base-table access in some cases.
However, this is not a solution for arbitrary user metadata. The number of included columns is limited, and the feature is clearly designed for a small fixed set of important attributes.
So included columns should be treated as a late-stage optimization, not as the main answer to flexible metadata.
19. A practical default architecture
A strong default architecture for a multi-tenant Oracle vector store is the following:
-
keep tenant and document identity in normal relational columns
-
store document-level flexible metadata in a JSON column on the document table
-
store chunk text and embeddings in a separate chunk table
-
create relational indexes for tenant and join scope
-
create one JSON search index on document metadata
-
create a vector index on the embedding column
-
use workload analysis to decide which metadata fields deserve dedicated indexes or virtual columns
This architecture is consistent with Oracle’s JSON indexing model, Oracle’s vector-search model, and the product idea of attaching structured attributes to higher-level content objects.
20. Related topics that matter, but belong in their own discussion
Several adjacent topics are also important, but each deserves its own treatment.
First, chunking strategy matters. Retrieval quality depends not only on indexing, but also on how documents are split. Metadata that is truly document-level should usually remain at the document level.
Second, vector accuracy versus cost matters. Approximate vector search settings affect latency, quality, and hardware usage.
Third, hybrid retrieval matters. If user queries contain exact product names, codes, or error identifiers, a hybrid approach that combines keyword and vector retrieval may be more robust than pure semantic search.
Fourth, optimizer statistics matter. Virtual columns are useful not only because they can be indexed, but also because they can carry statistics that improve plan quality.
Finally, API design and storage design should not be confused. A simple public API does not require a simplistic internal schema.
21. Conclusion
The cleanest way to think about this system is as follows.
Flexible user metadata belongs in JSON. Unknown and long-tail fields should first be handled by a general JSON search index. Known and high-frequency fields should later be promoted into function-based indexes or virtual columns. Document-level metadata should usually remain at the document level rather than being duplicated on every chunk row. Semantic retrieval belongs to the vector index. Oracle may execute filtering before or after vector traversal depending on the optimizer and index type, unless you explicitly guide it through supported hints or API parameters. And the only reliable way to verify the actual behavior is to inspect the execution plan and confirm which indexes Oracle used.
That is the design that balances flexibility, performance, and operational sanity in a production Oracle 26ai vector store.
No comments:
Post a Comment