Version: v2.4.1

Solr Search Infrastructure

Overview

The AlphaSense Enterprise Insight platform powers document search through a Solr infrastructure stack comprising query routing, SolrCloud clusters, and indexing pipelines.

The se-solrcloud-router sits between doc-search-realtime and the Solr data nodes, routing all search queries to the appropriate Solr collections on se-solr-node. Writes into Solr flow through the indexer and solr-queue-indexer services, with bulk repairs handled by the reindexer.

The control plane is managed by solrcloud-operator, se-zookeeper-k8, se-solrcloud-configs-k8, and se-solrcloud-collection-manager.

This runbook applies when symptoms point to Solr latency, router saturation, indexing lag, or Solr collection and configuration problems.

This document serves to outline common failure scenarios related to Solr infrastructure and provide troubleshooting steps for resolution.

Before Investigating: Quick Triage

Confirm the following within the first few minutes:

Symptom type: Search UI latency (p95/p99 elevated) vs outright errors vs missing/stale documents
Path affected: Read path (router/Solr query latency) or write path (indexing lag, missing docs)
Scope: SaaS vs Private Cloud tenant; specific collections or all
Starting point: Always check router health first — it is the most common choke point — then Solr node health

Failure Scenarios

1. Search Is Slow (High p95/p99 Latency)

Triage:

Users report that search results are returning slowly. p95/p99 latency is elevated above baseline. This is typically caused by router saturation where threads are piling up and requests are queuing, or by Solr query latency on the data nodes — particularly GC pressure or hot shards on specific Solr nodes.

Troubleshooting:

Check se-solrcloud-router pod health, resource usage, and restart history:

kubectl get pods -n search-engine -l app=se-solrcloud-router-solr9
kubectl logs -n search-engine -l app=se-solrcloud-router-solr9 --tail=200

Look for OOMKills, restarts, CPU or memory spikes, and signs of thread pile-ups or GC pressure.

Check Solr node health for pod readiness issues, restarts, and memory pressure:

kubectl get pods -n search-engine -l app=se-solr-node
kubectl logs -n search-engine -l app=se-solr-node --tail=200

Determine whether the slowness is isolated to specific search modes or collections (e.g., REGULAR, LYS, SEARCH_API) if that information is available.

As a low impact solution, restarting the deployment could remove the slowness.

kubectl rollout restart deployment/se-solrcloud-router-solr9 -n search-engine

If Solr nodes are the bottleneck, do not take disruptive actions (shard moves, node restarts) without coordinating with AS Support team.

Confirm resolution when router queueing drops and Solr query latency returns to baseline.

2. Search Errors / Timeouts (5xx, Upstream Timeouts)

Triage:

Users are receiving search errors, 5xx responses, or upstream timeouts.

This is typically caused by router pods crashlooping or becoming saturated, Zookeeper instability causing collection discovery failures, or Solr nodes rejecting requests.

Troubleshooting:

Check router pod readiness, restart counts, and recent deploys:

kubectl logs -n search-engine -l app=se-solrcloud-router-solr9 --tail=200
kubectl logs -n search-engine -l app=se-solrcloud-router-lys-solr9 --tail=200

Check Zookeeper pod health and logs for leader election churn or instability:

kubectl logs -n search-engine -l app=zookeeper --tail=200

Warning: If Zookeeper is unhealthy, treat this as high severity and escalate to AS support team immediately. Do not attempt to resolve Zookeeper issues without AS support team alignment.

3. Missing or Stale Documents (Indexing Issues)

Triage:

Users report that documents expected to appear in search results are missing, or that search results do not reflect recent content changes. Before investigating, determine whether this is a read-path issue (documents exist in Solr but are not being returned due to filters or routing) or a write-path issue (documents were never indexed or indexing has fallen behind).

Troubleshooting:

Symptom

A user-generated document appears to be indexed with sharee IDs (or annotation access IDs), but those sharee IDs cannot be validated via the new authorization platform / entitlement properties.

Working hypothesis

Likely data desynchronization between legacy useractivity-storage (DynamoDB-backed, later migrated to Scylla) and current activity-storage (MongoDB-backed).
Some documents (especially older ones) may have been indexed using legacy activity/share data that is not reflected in activity-storage today.

Why this matters / what to check

The indexer populates user-doc access fields based on activity-storage activities (EntityAccess / AutoShare) and does not call entitlement properties for user docs.
Therefore, if the sharee list is wrong in Solr, focus on activity storage sources and the indexing inputs, not entitlement-ng.

Solr check (is Solr holding the "wrong" sharee IDs?)

Port-forward to a Solr node:

kubectl -n search-engine port-forward se-solr-node-0 8983

Query the document and inspect access fields:

curl 'http://localhost:8983/solr/searching_uc_private_alias/query' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "*:*",
    "filter": "DocumentId:{YOUR_DOCUMENT_ID}",
    "fields": ["DocumentId","SourceUserId","DocAccessShareeIds","AccessShareeIds"],
    "limit": 10
  }'

DocAccessShareeIds: document-level access for user docs
AccessShareeIds: sub-resource (annotation) access
SourceUserId: document creator

MongoDB check (what does activity-storage say for sharees?)

Fetch Mongo credentials (cluster-specific secret name may differ by env):

kubectl get secret -n platform activity-storage-processing-mongodb-mongodb -o json | jq .data

Port-forward to Mongo primary (replace {PRIMARY_INSTANCE_ID}):

kubectl port-forward -n mongodb shared-mongodb-{PRIMARY_INSTANCE_ID} 27017

Connect + auth:

mongosh activity
db.auth("{USERNAME}", "{PASSWORD}")

Inspect the EntityAccess activity for the document:

db.activityDoc.find({
  'parents.type': 'Document',
  'parents.entityId': '{YOUR_DOCUMENT_ID}',
  activityType: 'EntityAccess',
})

Decision / next steps

If Solr sharee IDs ≠ Mongo sharee IDs for the same document, treat this as an index input/data-source mismatch (likely legacy vs current activity storage).
Collect a list of affected document IDs by exporting Mongo sharee data and comparing to Solr output for those docs.
Escalate to the owning team for activity-storage / indexer with:
- Document IDs
- Solr fields (DocAccessShareeIds, AccessShareeIds, SourceUserId)
- Mongo EntityAccess evidence

(Source: Citadel investigation on missing document sharing information.)

First, confirm whether the document exists in the Solr index. Make sure to port-forward to se-solr-node.search-engine, then call the following URL, replacing <port-number> and <document-id>:

http://localhost:<port-number>/solr/searching_uc_private_alias/select?q=*:*&rows=1&fq=DocumentId:<document-id>

If it does not, the issue is in the write path.

Check indexer pod health and logs for errors converting document metadata:

#for internal docs
kubectl logs -n search-engine -l app=indexer-private-solr9
#for public docs
kubectl logs -n search-engine -l app=indexer-solr9

Check solr-queue-indexer for queue consumer lag and errors writing to Solr:

#for internal docs
kubectl logs -n search-engine -l app=solr-queue-indexer-usercontent-solr9
#for public docs
kubectl logs -n search-engine -l app=solr-queue-indexer-solr9

If a bulk repair or reindex was recently triggered, check the reindexer status:

kubectl logs -n search-engine -l app=reindexer-solr9 --tail=200

If the indexer is producing conversion errors, identify the failing document IDs and coordinate with the owning team to resolve the data issues.

If queue indexing lag is the cause, scale consumers in alignment with Search Infra or resolve the downstream Solr bottleneck causing write pressure.

If the document exists in Solr but is not returned in search results, the issue is on the read path — refer to the Runbook for Document Search and Entitlement Flow for entitlement and routing checks.

Overview​

Before Investigating: Quick Triage​

Failure Scenarios​

1. Search Is Slow (High p95/p99 Latency)​

Triage:​

Troubleshooting:​

2. Search Errors / Timeouts (5xx, Upstream Timeouts)​

Triage:​

Troubleshooting:​

3. Missing or Stale Documents (Indexing Issues)​

Triage:​

Troubleshooting:​

3.1 User doc sharing visibility mismatch (DocAccessShareeIds / AccessShareeIds appear wrong)​

Overview

Before Investigating: Quick Triage

Failure Scenarios

1. Search Is Slow (High p95/p99 Latency)

Triage:

Troubleshooting:

2. Search Errors / Timeouts (5xx, Upstream Timeouts)

Triage:

Troubleshooting:

3. Missing or Stale Documents (Indexing Issues)

Triage:

Troubleshooting:

3.1 User doc sharing visibility mismatch (DocAccessShareeIds / AccessShareeIds appear wrong)