Version: v2.4.1

Search Slowness Troubleshooting

Overview

The AlphaSense Enterprise Insight platform in Private Cloud routes search requests across two distinct paths: searches against locally indexed private content handled by the Solr private cloud installation, and searches against public content routed through the AlphaSense mothership.

When users report search slowness, the first step is to determine which path is responsible before escalating to AlphaSense Support.

This document serves to outline how to diagnose and isolate search slowness and provide troubleshooting steps for resolution.

Failure Scenarios

1. Search Is Slow Across All Content Types

Triage:

Users report that search results are taking an unusually long time to return, but it is not clear whether the slowness affects private content, public content, or both. Before escalating, isolate which search path is responsible to avoid unnecessary investigation of healthy systems.

Troubleshooting:

Ask affected users to run two separate searches using the UI filters so we can isolate which backend path is slow:

Private / local content only — exercises only the private cloud Solr path.
Public content only — exercises the connection to the AlphaSense mothership.

Alternative: query directly at the router level

You can isolate the path without involving end users by running sample queries directly on the se-solrcloud-router level.

Open an SSH tunnel to one of the se-solrcloud-router-solr9 pods:

kubectl port-forward se-solrcloud-router-solr9-0 8983 -n search-engine

Then issue the following queries based on content type:

Private content: http://localhost:8983/solr/#/all/query?q=apple sales&q.op=OR&indent=true&_target_=uc_private&_targetroute_.uc_private=AcceptanceDate:1713969025000-1777041025000!SourceCompanyId:1&useParams=
Public content: http://localhost:8983/solr/#/all/query?q=apple sales&q.op=OR&indent=true&_target_=public&_targetroute_.public=AcceptanceDate:1713969025000-1777041025000!CodeLevel1:17,18,19,27,30,38&_routerKeys_=mothership&useParams=

The queries above search for content between Apr 2024 and Apr 2026. You can change this by updating the epoch values in AcceptanceDate:1713969025000-1777041025000 using an Epoch Converter.

How to interpret results:

Only one search is slow: The issue is isolated to that specific path. Continue with the relevant scenario below.
Both searches are slow: Start by investigating private search, since it is fully within the private cloud environment and is the most directly actionable.

Important note: shared routers can mask the root cause

Public and private searches share the same router layer. Latency from one path can consume router threads and increase latency for the other path.

If private-only searches are reported as slow even when Solr data-node graphs do not show latency, treat this as a potential router saturation issue.

Mitigation (recommended): restart routers

If symptoms suggest router saturation, restart the solrcloud-router-solr9 service:

kubectl rollout restart deployment/se-solrcloud-router-solr9 -n search-engine

Routers are stateless, so a restart is a safe first mitigation step. If routers are overwhelmed, everything can appear slow regardless of what Solr data-node metrics show.

2. Private Search Is Slow (Local Solr)

Triage:

Search against locally indexed private content is slow. This indicates the issue is within the private cloud Solr installation. Consult AlphaSense Support as needed to help with the resolution.

Troubleshooting:

Use telemetry graphs to pinpoint where latency is coming from

Use the private cloud-hosted Google Cloud graphs (router + Solr data node telemetry) to identify whether the latency is coming from the router layer or from specific Solr shards.

1) Check router endpoint latency

Review latency metrics for router endpoints in your internal graphs, such as all_LYS and all.

The latency reported by the routers includes both private and public searches. Combined with step 2 below, you can rule out the private SolrCloud data nodes if no latency is reported there — leaving the private solrcloud-routers and/or public searches as the likely culprits.

2) Check shard-level latency (private SolrCloud)

Review latency metrics in your internal graphs for individual Solr shards that are local to the private cloud environment.

3) Determine the pattern

Classify the issue based on what you see:

Slowness concentrated on a specific shard — usually points to a node-level resource issue impacting that shard. Consult AlphaSense Support for guidance.
Uniform slowness across all shards — typically suggests a broader query load or infrastructure-wide problem.

Additional considerations (common causes and mitigations)

1) Disk near full on one or more data nodes (~80%+)

Check disk usage across all Solr data nodes:

for pod in $(kubectl get pods -o name | grep se-solr-node); do
  echo "=== $pod ==="
  kubectl exec "$pod" -- df -h /opt/solr-9.3.0/server/data
done

If any pod is showing 80%+ disk utilization (e.g. se-solr-node-0), expand the PVC. First confirm the current size:

kubectl get pod se-solr-node-0 -o jsonpath='{.spec.volumes[*].persistentVolumeClaim.claimName}' | xargs -n1 kubectl get pvc

The output will show the current capacity in the CAPACITY column (e.g. 2000Gi). Increase it by 500Gi:

kubectl patch persistentvolumeclaim/data-se-solr-node-3 -p '{"spec":{"resources":{"requests":{"storage":"2500Gi"}}}}'

2) A Solr data node has consistently high CPU

Use the solr-cpu-usage.sh script to check CPU utilization across all nodes:

solr-cpu-usage.sh
#!/bin/bash
# Shows CPU usage vs limits for se-solr-node pods
# Usage: ./solr-cpu-usage.sh [--watch POD_NAME] [--interval SECONDS] [--namespace NAMESPACE]

WATCH_POD=""
INTERVAL=5
NAMESPACE=""

while [[ $# -gt 0 ]]; do
  case "$1" in
    --watch)
      WATCH_POD="$2"
      shift 2
      ;;
    --interval)
      INTERVAL="$2"
      shift 2
      ;;
    -n|--namespace)
      NAMESPACE="$2"
      shift 2
      ;;
    *)
      shift
      ;;
  esac
done

NS_FLAG=""
if [[ -n "$NAMESPACE" ]]; then
  NS_FLAG="-n $NAMESPACE"
fi

print_header() {
  printf "%-30s %-12s %-12s %-8s\n" "POD" "USED" "LIMIT" "CPU%"
  printf "%-30s %-12s %-12s %-8s\n" "---" "----" "-----" "----"
}

print_pod() {
  local pod="$1"
  local usage limit limit_m pct
  usage=$(kubectl top pod $NS_FLAG "$pod" --no-headers 2>/dev/null | awk '{print $2}' | sed 's/m//')
  if [[ -z "$usage" ]]; then
    printf "%-30s %-12s\n" "$pod" "N/A"
    return
  fi
  limit=$(kubectl get pod $NS_FLAG "$pod" -o jsonpath='{.spec.containers[0].resources.limits.cpu}')
  if [[ -z "$limit" ]]; then
    printf "%-30s %-12s %-12s %-8s\n" "$pod" "${usage}m" "no limit" "N/A"
    return
  fi
  if [[ "$limit" == *m ]]; then
    limit_m=${limit%m}
  else
    limit_m=$((limit * 1000))
  fi
  pct=$((usage * 100 / limit_m))
  printf "%-30s %-12s %-12s %-8s\n" "$pod" "${usage}m" "${limit_m}m" "${pct}%"
}

if [[ -n "$WATCH_POD" ]]; then
  echo "Watching $WATCH_POD every ${INTERVAL}s (Ctrl+C to stop)"
  echo ""
  while true; do
    clear
    echo "$(date '+%Y-%m-%d %H:%M:%S') — watching $WATCH_POD (every ${INTERVAL}s)"
    echo ""
    print_header
    print_pod "$WATCH_POD"
    sleep "$INTERVAL"
  done
else
  print_header
  for pod in $(kubectl get pods $NS_FLAG -o name | grep se-solr-node | sed 's|pod/||'); do
    print_pod "$pod"
  done
fi

Run the script to display CPU utilization per node:

./solr-cpu-usage.sh

To monitor a specific node continuously (refresh every 5 seconds):

./solr-cpu-usage.sh --watch se-solr-node-5 --interval 5

Restart any node showing sustained high CPU usage (e.g. 90%+):

kubectl rollout restart pod/se-solr-node-5 -n search-engine

3) Shards have grown large and were never split

Open an SSH tunnel to one of the se-solr-node nodes:

kubectl port-forward se-solr-node-solr9 8983 -n search-engine

solr-capacity-planning.sh
#!/bin/bash
# Solr Capacity Planning Script
# Fetches total document count and index size across all nodes and collections
# Usage: ./solr-capacity-planning.sh [solr-host] [--over-size GB] [--node-pattern REGEX]
# Example: ./solr-capacity-planning.sh se-solr-node.search-engine.svc.rc.local --over-size 50
# Example: ./solr-capacity-planning.sh --node-pattern 'se-solr-node-hot-[0-9]'

SOLR_HOST=""
MIN_SIZE_GB=0
NODE_PATTERN="se-solr-node-[0-9]"

while [[ $# -gt 0 ]]; do
  case "$1" in
    --over-size)
      MIN_SIZE_GB="$2"
      shift 2
      ;;
    --node-pattern)
      NODE_PATTERN="$2"
      shift 2
      ;;
    *)
      SOLR_HOST="$1"
      shift
      ;;
  esac
done

SOLR_HOST="${SOLR_HOST:-localhost:8983}"
SOLR="http://$SOLR_HOST"

echo "Connecting to Solr at: $SOLR"
echo ""

# Step 1: Get all live nodes
echo "Fetching live nodes..."
NODES=$(curl -s --max-time 30 "$SOLR/solr/admin/collections?action=CLUSTERSTATUS&wt=json" \
  | jq -r --arg pattern "$NODE_PATTERN" '[.cluster.live_nodes[] | select(test($pattern))] | join(",")')

if [ -z "$NODES" ]; then
  echo "ERROR: Could not retrieve live nodes from $SOLR"
  exit 1
fi

NODE_COUNT=$(echo "$NODES" | tr ',' '\n' | wc -l | tr -d ' ')
echo "Found $NODE_COUNT live nodes"
echo ""

# Step 2: Fetch metrics from all nodes in one request
echo "Fetching INDEX metrics from all nodes..."
METRICS_JSON=$(curl -s --max-time 60 \
  "$SOLR/solr/admin/metrics?group=core&prefix=SEARCHER.searcher.numDocs,INDEX.sizeInBytes&nodes=$NODES&wt=json")

if [ -z "$METRICS_JSON" ] || echo "$METRICS_JSON" | jq -e '.error' > /dev/null 2>&1; then
  echo "ERROR: Failed to fetch metrics"
  echo "$METRICS_JSON" | jq '.error' 2>/dev/null
  exit 1
fi

# Step 3: Aggregate
# - totalDocs: deduplicate by shard (one replica per shard) — replicas hold identical docs
# - totalDiskSizeGB: sum ALL replicas — each replica occupies real disk space
# Response is keyed by node name: { "node:8983_solr": { "metrics": { "solr.core.X": {...} } } }
# Core name format: solr.core.<collection>.<shard>.<replica>  (dot-separated)
echo "$METRICS_JSON" | jq --argjson minSize "$MIN_SIZE_GB" '
  def collType:
    if test("solr9_history") then "history"
    elif test("solr9_hot_public") then "hot_public"
    elif test("solr9_image") then "image"
    elif test("solr9_lys") then "lys"
    elif test("solr9_pub_dashboard") then "pub_dashboard"
    elif test("solr9_uc_private_image") then "uc_private_image"
    elif test("solr9_uc_private") then "uc_private"
    elif test("solr9_uc_public") then "uc_public"
    elif test("solr9_annotation") then "annotation"
    else "other"
    end;

  [ del(.responseHeader) | to_entries[] |
    .key as $nodeKey |
    ($nodeKey | split(":")[0]) as $node |
    .value.metrics | to_entries[] |
    {
      shard: (.key | ltrimstr("solr.core.") | gsub("\\.[^\\.]+$"; "")),
      node: $node,
      numDocs: (.value["SEARCHER.searcher.numDocs"] // 0),
      sizeInBytes: (.value["INDEX.sizeInBytes"] // 0)
    }
  ] as $all |

  # Deduplicate by shard for docs, take max size across replicas for shard size
  ($all | group_by(.shard) | map({
    shard: .[0].shard,
    collType: (.[0].shard | collType),
    nodes: [.[].node] | unique,
    numDocs: .[0].numDocs,
    shardSizeGB: ([.[].sizeInBytes] | max / 1073741824 * 100 | round / 100),
    totalReplicaSizeGB: ([.[].sizeInBytes] | add / 1073741824 * 100 | round / 100),
    replicaCount: length
  })) as $deduped |

  $deduped | sort_by(.collType, .shard)
    | [.[] | select(.shardSizeGB >= $minSize)]
    | map({
      shard,
      collectionType: .collType,
      nodes,
      numDocs,
      shardSizeGB,
      replicaCount,
      totalReplicaSizeGB
    })
'

Then run the capacity planning script to check for overly large shards (typically 60 GB+):

./solr-capacity-planning.sh --over-size 60

If oversized shards are identified, coordinate with AlphaSense Support to manually split them.

4) Cluster imbalance (some machines over-utilized)

If the disk check from step 1 indicates a pod is using significantly more disk than others, re-balancing is needed. Re-balance by moving collections from over-utilized machines to less-loaded ones. Coordinate with AlphaSense Support for guidance.

3. Public Search Is Slow (Mothership / SaaS)

Triage:

Search against public content is slow while private content search performs normally. This indicates the issue is in the network path or systems between the private cloud environment and the AlphaSense mothership, and requires escalation to AlphaSense Support.

Troubleshooting:

Collect the following information before escalating to AlphaSense Support:

Timestamp range of the reported slowness (start and end time with timezone).
Sample ReqIds from slow requests — every search request originating from the private cloud router includes a ReqId that is propagated through all downstream calls. Request a list of ReqIds from affected users, or extract them from the private cloud router logs for the relevant time window. To locate queries taking longer than 10 seconds:

{app="se-solrcloud-router-solr9"} |~ `QTime=[0-9]{5}`

Example queries that experienced slowness, including the approximate response time observed.
Whether the issue is consistent or intermittent, and whether it affects all users or specific users.

Provide the above to AlphaSense Support (@Shane Conroy, @Saji Mathew). The ReqId is traceable on the AlphaSense side and will be the most effective way to identify the specific requests and isolate the failing hop in the public search path.

info

All private cloud search requests against public content are routed to the cold Solr cluster on the AlphaSense side. A single private cloud user search fans out to multiple shard-level requests, and the slowest shard governs the total observed search time — P99 latency is therefore the most meaningful metric to report.

Overview​

Failure Scenarios​

1. Search Is Slow Across All Content Types​

Triage:​

Troubleshooting:​

How to interpret results:​

Important note: shared routers can mask the root cause​

Mitigation (recommended): restart routers​

2. Private Search Is Slow (Local Solr)​

Triage:​

Troubleshooting:​

Use telemetry graphs to pinpoint where latency is coming from​

Additional considerations (common causes and mitigations)​

3. Public Search Is Slow (Mothership / SaaS)​

Triage:​

Troubleshooting:​

Overview

Failure Scenarios

1. Search Is Slow Across All Content Types

Triage:

Troubleshooting:

How to interpret results:

Important note: shared routers can mask the root cause

Mitigation (recommended): restart routers

2. Private Search Is Slow (Local Solr)

Triage:

Troubleshooting:

Use telemetry graphs to pinpoint where latency is coming from

Additional considerations (common causes and mitigations)

3. Public Search Is Slow (Mothership / SaaS)

Triage:

Troubleshooting: