Solr Search Infrastructure
Overview
The AlphaSense Enterprise Insight platform powers document search through a Solr infrastructure stack comprising query routing, SolrCloud clusters, and indexing pipelines.
The se-solrcloud-router sits between doc-search-realtime and the Solr data nodes, routing all search queries to the appropriate Solr collections on se-solr-node. Writes into Solr flow through the indexer and solr-queue-indexer services, with bulk repairs handled by the reindexer.
The control plane is managed by solrcloud-operator, se-zookeeper-k8, se-solrcloud-configs-k8, and se-solrcloud-collection-manager.
This runbook applies when symptoms point to Solr latency, router saturation, indexing lag, or Solr collection and configuration problems.
This document serves to outline common failure scenarios related to Solr infrastructure and provide troubleshooting steps for resolution.
Before Investigating: Quick Triage
Confirm the following within the first few minutes:
- Symptom type: Search UI latency (p95/p99 elevated) vs outright errors vs missing/stale documents
- Path affected: Read path (router/Solr query latency) or write path (indexing lag, missing docs)
- Scope: SaaS vs Private Cloud tenant; specific collections or all
- Starting point: Always check router health first — it is the most common choke point — then Solr node health
Failure Scenarios
1. Search Is Slow (High p95/p99 Latency)
Triage:
Users report that search results are returning slowly. p95/p99 latency is elevated above baseline. This is typically caused by router saturation where threads are piling up and requests are queuing, or by
Solr query latency on the data nodes — particularly GC pressure or hot shards on specific Solr nodes.
Troubleshooting:
Check se-solrcloud-router pod health, resource usage, and restart history:
kubectl get pods -n search-engine -l app=se-solrcloud-router-solr9
kubectl logs -n search-engine -l app=se-solrcloud-router-solr9 --tail=200
Look for OOMKills, restarts, CPU or memory spikes, and signs of thread pile-ups or GC pressure.
Check Solr node health for pod readiness issues, restarts, and memory pressure:
kubectl get pods -n search-engine -l app=se-solr-node
kubectl logs -n search-engine -l app=se-solr-node --tail=200
Determine whether the slowness is isolated to specific search modes or collections (e.g., REGULAR, LYS, SEARCH_API) if that information is available.
As a low impact solution, restarting the deployment could remove the slowness.
kubectl rollout restart deployment/se-solrcloud-router-solr9 -n search-engine
If Solr nodes are the bottleneck, do not take disruptive actions (shard moves, node restarts) without coordinating with AS Support team.
Confirm resolution when router queueing drops and Solr query latency returns to baseline.
2. Search Errors / Timeouts (5xx, Upstream Timeouts)
Triage:
Users are receiving search errors, 5xx responses, or upstream timeouts.
This is typically caused by router pods crashlooping or becoming saturated, Zookeeper instability causing collection discovery failures, or Solr nodes rejecting requests.
Troubleshooting:
Check router pod readiness, restart counts, and recent deploys:
kubectl logs -n search-engine -l app=se-solrcloud-router-solr9 --tail=200
kubectl logs -n search-engine -l app=se-solrcloud-router-lys-solr9 --tail=200
Check Zookeeper pod health and logs for leader election churn or instability:
kubectl logs -n search-engine -l app=zookeeper --tail=200
If Zookeeper is unhealthy, treat this as high severity and escalate to AS support team immediately. Do not attempt to resolve Zookeeper issues without AS support team alignment.
3. Missing or Stale Documents (Indexing Issues)
Triage:
Users report that documents expected to appear in search results are missing, or that search results do not reflect recent content changes. Before investigating, determine whether this is a read-path issue (documents exist in Solr but are not being returned due to filters or routing) or a write-path issue (documents were never indexed or indexing has fallen behind).
Troubleshooting:
3.1 User doc sharing visibility mismatch (DocAccessShareeIds / AccessShareeIds appear wrong)
Symptom
- A user-generated document appears to be indexed with sharee IDs (or annotation access IDs), but those sharee IDs cannot be validated via the new authorization platform / entitlement properties.
Working hypothesis
- Likely data desynchronization between legacy
useractivity-storage(DynamoDB-backed, later migrated to Scylla) and currentactivity-storage(MongoDB-backed). - Some documents (especially older ones) may have been indexed using legacy activity/share data that
is not reflected in
activity-storagetoday.
Why this matters / what to check
- The
indexerpopulates user-doc access fields based onactivity-storageactivities (EntityAccess/AutoShare) and does not call entitlement properties for user docs. - Therefore, if the sharee list is wrong in Solr, focus on activity storage sources and the
indexing inputs, not
entitlement-ng.
Solr check (is Solr holding the “wrong” sharee IDs?)
-
Port-forward to a Solr node:
kubectl -n search-engine port-forward se-solr-node-0 8983 -
Query the document and inspect access fields:
curl 'http://localhost:8983/solr/searching_uc_private_alias/query' \
-H 'Content-Type: application/json' \
-d '{
"query": "*:*",
"filter": "DocumentId:{YOUR_DOCUMENT_ID}",
"fields": ["DocumentId","SourceUserId","DocAccessShareeIds","AccessShareeIds"],
"limit": 10
}'DocAccessShareeIds: document-level access for user docsAccessShareeIds: sub-resource (annotation) accessSourceUserId: document creator
MongoDB check (what does activity-storage say for sharees?)
-
Fetch Mongo credentials (cluster-specific secret name may differ by env):
kubectl get secret -n platform activity-storage-processing-mongodb-mongodb -o json | jq .data -
Port-forward to Mongo primary (replace
{PRIMARY_INSTANCE_ID}):kubectl port-forward -n mongodb shared-mongodb-{PRIMARY_INSTANCE_ID} 27017 -
Connect + auth:
mongosh activity
db.auth("{USERNAME}", "{PASSWORD}") -
Inspect the
EntityAccessactivity for the document:db.activityDoc.find({
'parents.type': 'Document',
'parents.entityId': '{YOUR_DOCUMENT_ID}',
activityType: 'EntityAccess',
})
Decision / next steps
- If Solr sharee IDs ≠ Mongo sharee IDs for the same document, treat this as an index input/data-source mismatch (likely legacy vs current activity storage).
- Collect a list of affected document IDs by exporting Mongo sharee data and comparing to Solr output for those docs.
- Escalate to the owning team for
activity-storage/ indexer with:- Document IDs
- Solr fields (
DocAccessShareeIds,AccessShareeIds,SourceUserId) - Mongo
EntityAccessevidence
(Source: Citadel investigation on missing document sharing information.)
First, confirm whether the document exists in the Solr index. Make sure to port-forward to
se-solr-node.search-engine, then call the following URL, replacing <port-number> and
<document-id>:
http://localhost:<port-number>/solr/searching_uc_private_alias/select?q=*:*&rows=1&fq=DocumentId:<document-id>
If it does not, the issue is in the write path.
Check indexer pod health and logs for errors converting document metadata:
#for internal docs
kubectl logs -n search-engine -l app=indexer-private-solr9
#for public docs
kubectl logs -n search-engine -l app=indexer-solr9
Check solr-queue-indexer for queue consumer lag and errors writing to Solr:
#for internal docs
kubectl logs -n search-engine -l app=solr-queue-indexer-usercontent-solr9
#for public docs
kubectl logs -n search-engine -l app=solr-queue-indexer-solr9
If a bulk repair or reindex was recently triggered, check the reindexer status:
kubectl logs -n search-engine -l app=reindexer-solr9 --tail=200
If the indexer is producing conversion errors, identify the failing document IDs and coordinate with the owning team to resolve the data issues.
If queue indexing lag is the cause, scale consumers in alignment with Search Infra or resolve the downstream Solr bottleneck causing write pressure.
If the document exists in Solr but is not returned in search results, the issue is on the read path — refer to the Runbook for Document Search and Entitlement Flow for entitlement and routing checks.