Research · Data Science · Information Retrieval

ExpertSearch: Finding Domain Experts with Topic Modeling

July 1, 2011 Data Sciences Summer Institute University of Illinois at Urbana-Champaign Pradip Karki

ExpertSearch is a search engine that, given a paper abstract or a list of topics, returns a ranked list of faculty experts in that domain. Built as part of the Data Sciences Summer Institute (DSSI) at the University of Illinois at Urbana-Champaign, the system indexed 4,445 UIUC faculty members across 247 programs of study.

I was part of the Topic Modeling team, responsible for discovering the latent research interests of each expert from their crawled text (papers, homepages, and course pages), using Latent Dirichlet Allocation (LDA) with Gibbs Sampling.

Scale of the Corpus

13,609 Documents crawled

4,445 Faculty experts

987 MB Text collected

247 Programs of study

System Pipeline

The system was built by five independent teams that handed off data through each stage of the pipeline. My work sat in the middle, transforming raw text into a probabilistic topic representation before passing it to the Information Retrieval engine.

Data Crawling

Seeded from the UIUC phonebook; crawled up to 300 HTML pages per expert plus PDF publications. Stored homepage text, document counts, and links for all 4,445 faculty members.

Classification & Extraction

Supervised learning (Sparse Network Learner and bag-of-bigrams) achieved 87.3% accuracy in classifying homepages. RAKE and the Illinois Chunker extracted keyword noun phrases from papers and homepages to build each expert's "bag of words."

Topic Modeling: My Contribution

Applied LDA with Gibbs Sampling to the extracted text to produce a probability distribution of topics per expert and words per topic. This compact representation replaced raw keyword bags for retrieval.

Information Retrieval

Given a user query, the IR component ranked experts using both the Language Model (LM) and Topic Model (TM) scoring methods. TM scoring performed better on longer queries; LM performed better on shorter ones.

User Interface

Web app accepting paper abstracts, keywords, or department as input. Results included ranked expert lists, per-expert word clouds, and a geographical map showing top-ranked experts by location.

Topic Modeling: My Work

Why Topic Modeling?

Faculty members have multiple, overlapping areas of expertise. A raw keyword index treats all words as independent and struggles to surface semantic similarity; "machine learning" and "neural networks" look unrelated to a term-based system. Topic modeling solves this by discovering latent topics: shared vocabularies that group semantically related words together without any manual labeling.

LDA also dramatically reduces dimensionality. Instead of matching a query against tens of thousands of unique words per expert, the IR engine compares compact topic probability vectors, making retrieval both faster and more semantically meaningful.

Latent Dirichlet Allocation

LDA is a generative probabilistic model that assumes each document is a mixture of topics and each topic is a distribution over words. Given the corpus of expert text files, the model inferred:

A distribution of topics over each expert (expert → topic probabilities)
A distribution of words over each topic (topic → word probabilities)

Gibbs Sampling was used as the inference algorithm, iteratively reassigning each word to a topic based on its likelihood under the current model, converging to a stable topic structure across the corpus.

Sample Output

The table below shows a real example from the system output: expert 3301 had a 55% probability of belonging to topic 87, which was strongly associated with the words data, pattern, mine, and algorithm, correctly capturing a data mining expert.

Expert → Topic probabilities (Expert ID 3301) Topic 87 (data mining) → 0.5510
Topic 173 → 0.1278
Topic 199 → 0.0655
Topic 176 → 0.0490

Topic 87 → Top word probabilities data → 0.1776
pattern → 0.1247
mine → 0.0142
algorithm → 0.0126

Classification Results

The upstream classification step (not my team, but feeding directly into our input) achieved the following performance on homepage detection:

Learning algorithm: Sparse Network Learner, features: bag-of-bigrams
Overall accuracy: 87.3%
Homepage precision: 85.0% | recall: 90.7%
Non-homepage precision: 90.0% | recall: 84.9%
1,063 pages classified as faculty homepages out of the crawled corpus

Use Cases

The system was designed to support a range of real-world applications beyond academic search:

Campus event planning: generate an invitation list for a colloquium from a talk abstract
DHS preparedness: surface domain experts during natural disasters or national emergencies
Emergency medical response: rapidly identify specialists for urgent clinical situations
Mobile and web deployment: the modular pipeline was designed for extension to native apps

Tech Stack

Topic Modeling: LDA with Gibbs Sampling (custom implementation)
NLP / Extraction: RAKE (Rapid Automatic Keyword Extraction), Illinois Chunker (NP detection)
Classification: Sparse Network Learner, bag-of-bigrams feature representation
Information Retrieval: Language Model (LM) and Topic Model (TM) ranking
Web crawling: Custom crawler seeded from UIUC phonebook, depth-limited to 300 pages per expert
UI: Web application with search interface, word cloud visualization, and geographic expert map

Reflection

ExpertSearch was my first exposure to probabilistic machine learning at scale. Working on topic modeling within a larger research team, with clearly defined interfaces between pipeline stages, introduced me to the discipline of building components that others depend on. The LDA output had to be correct in format, range, and semantics before the IR team could use it.

The project also shaped my view of what "search" means beyond keywords: matching intent through learned representations rather than exact term overlap. That idea has continued to inform my work in distributed systems and data engineering more than a decade later.

← Back to Projects