Document Ranking
Instructions
Rank and score documents based on similarity metrics already computed by the Retrieval Agent. This skill operates on retrieved documents with distance/similarity information - it does NOT query ChromaDB again.
Default workflow:
- Receive documents from Retrieval Agent (includes distance and similarity_score)
- Call ranking functions to sort documents by relevance
- Optionally filter by similarity threshold
- Return ranked list for downstream processing (grading or generation)
Key functions:
# Rank documents by similarity score (descending)
ranked_docs = rank_documents_by_similarity(documents)
# Rank documents by distance (ascending - lower is better)
ranked_docs = rank_documents_by_distance(documents)
# Filter documents by similarity threshold
filtered_docs = filter_by_similarity_threshold(documents, threshold=0.7)
# Get top-k ranked documents
top_docs = get_top_k_documents(ranked_docs, k=10)
Similarity Metrics:
Each document from Retrieval Agent contains:
distance: Cosine distance from ChromaDB (lower = more similar)- Range: [0, 2] for cosine distance
- 0 = identical, 2 = opposite
similarity_score: Computed as1 - distance(higher = more similar)- Range: [-1, 1] for cosine
- 1 = identical, -1 = opposite
- Typical useful range: [0.5, 1.0]
Ranking Strategies:
-
By Similarity Score (Recommended): Sort descending by
similarity_score- Higher scores = more relevant
- Intuitive: 0.9 > 0.7 > 0.5
-
By Distance: Sort ascending by
distance- Lower distances = more relevant
- Direct from vector search
-
Hybrid with Collection Priority: Rank by score within each collection, then merge
Critical: NEVER query ChromaDB
This skill operates on already-retrieved documents. The Retrieval Agent has already computed distances and similarity scores. Ranking simply sorts and filters based on these existing metrics.
Implementation: Functions should be in components/ranker.py, similar to components/grader.py.
Examples
Example 1: Basic ranking by similarity
from components.ranker import rank_documents_by_similarity
# Input: Documents from Retrieval Agent
# Each has: document, metadata, distance, collection, similarity_score
documents = [
{'document': 'Laptop A...', 'similarity_score': 0.85, 'distance': 0.15, ...},
{'document': 'Laptop B...', 'similarity_score': 0.92, 'distance': 0.08, ...},
{'document': 'Laptop C...', 'similarity_score': 0.73, 'distance': 0.27, ...},
]
# Rank by similarity (descending)
ranked = rank_documents_by_similarity(documents)
# Output: [Laptop B (0.92), Laptop A (0.85), Laptop C (0.73)]
Example 2: Filter by threshold then rank
from components.ranker import filter_by_similarity_threshold, rank_documents_by_similarity
# Input: 15 documents from 3 collections (5 each)
documents = retrieve_from_chromadb("gaming laptop", collections=["catalog", "faq", "troubleshooting"])
# Filter to only high-quality matches (similarity > 0.7)
high_quality = filter_by_similarity_threshold(documents, threshold=0.7)
# Reduced from 15 to 8 documents
# Rank the high-quality matches
ranked = rank_documents_by_similarity(high_quality)
# Output: 8 documents sorted by similarity, all > 0.7
Example 3: Combined ranking and grading workflow
from components.ranker import rank_documents_by_similarity, get_top_k_documents
from components.grader import grade_documents, filter_relevant_documents
# Step 1: Retrieve documents (done by Retrieval Agent)
retrieved_docs = await retrieval_agent.retrieve_documents("best laptop for video editing", top_k=5)
# Retrieved 15 documents (5 per collection)
# Step 2: Rank by similarity score
ranked_docs = rank_documents_by_similarity(retrieved_docs)
# Step 3: Take top 10 for grading (reduce cost)
top_docs = get_top_k_documents(ranked_docs, k=10)
# Step 4: Grade for binary relevance
graded_docs = grade_documents("best laptop for video editing", top_docs)
relevant_docs = filter_relevant_documents(graded_docs)
# Output: Only the most relevant documents (high similarity + graded as relevant)
Example 4: Ranking within collections
from components.ranker import rank_by_collection
# Input: Mixed documents from multiple collections
documents = retrieve_from_chromadb("laptop warranty", collections=["catalog", "faq"])
# Rank within each collection, then combine
ranked = rank_by_collection(documents)
# Output: {
# 'catalog': [doc1 (0.88), doc2 (0.75), doc3 (0.62)],
# 'faq': [doc4 (0.95), doc5 (0.91), doc6 (0.84)]
# }
# Use this to prioritize certain collections or balance results
Distance vs Similarity Score
When to use each:
-
Similarity Score: Easier to understand, use for thresholds and display
- "Keep documents with similarity > 0.7"
- "Top document has 92% similarity"
-
Distance: Direct from vector search, use for debugging
- "ChromaDB returned distance of 0.08"
- "Check if distance < 0.3 for high confidence"
Conversion:
similarity_score = 1 - distance
distance = 1 - similarity_score
Typical thresholds:
- similarity_score > 0.8: Very relevant
- similarity_score > 0.7: Relevant
- similarity_score > 0.5: Possibly relevant
- similarity_score < 0.5: Likely not relevant
Integration with Grading
Ranking and grading serve different purposes:
-
Ranking: Sorts documents by similarity score (continuous 0-1)
- Fast, cheap (no API calls)
- Based on vector similarity alone
- Use for initial filtering and prioritization
-
Grading: Binary relevance with reasoning (yes/no)
- Slower, costs tokens (Claude API)
- Semantic understanding of relevance
- Use for final filtering before generation
Recommended workflow:
- Retrieve documents (Retrieval Agent)
- Rank by similarity (Ranking skill)
- Take top-k to reduce grading cost
- Grade for binary relevance (Grading skill)
- Generate answer from relevant docs (Generator Agent)
微信扫一扫