The Citegeist™ Relevancy Engine

Citegeist is the name of the relevancy engine that ranks search results on CourtListener. It’s the product of years of enhancements and a deep background in legal research.

Citegeist works by combining classic search relevancy algorithms with state-of-the-art legal ranking technology. At a high level, Citegeist has two ranking algorithms: keyword search and semantic search.

Below, we provide a high-level summary of the ways we rank search results. This information is provided to help legal researchers understand how the CourtListener system works.

Keyword search looks for words and connectors in your query that match those of the content, and uses a variety of techniques to rank the results so that the best results are at the top.

When you do a keyword search, the following technologies are used to rank the results:

  • Stemming — When a query is made, the first step is to find the root of each word so that the query can match all relevant documents. This step broadens each query so that words like runner, running, and run are all treated as equal.
  • Synonyms — Citegeist has a list of nearly 1,000 synonyms so that queries containing words like IRS will match documents containing Internal Revenue Service. This list was created by analyzing the top acronyms used in legal documents.
  • BM25 — This algorithm forms the heart of keyword ranking. In brief, it uses the frequency of words and the length of documents to identify the most important results for a query.
  • Field Boosting — Matches on some fields are more meaningful than others. For example, a match on the name of a case is more important than a match in the body of a case. We boost accordingly.
  • Phrase Boosting — When a user makes a query with multiple words, we boost if those words are found near each other in a result, indicating that a phrase in your query matched a phrase in a result.
  • Relevance Decay — As content gets older, it gets less relevant. A relevance decay curve is applied to our search results so that older content is demoted.
  • Jurisdiction Boosting — The higher the court, the more important the result. Lower courts get demoted by Citegeist.
  • Citation Boosting (coming soon) — We analyze the network of citations between cases in order to boost the cases that are most important.

Not all of these technologies are used in all of our search engines. The following table shows where these technologies are used:

Feature Case Law RECAP Archive Oral Arguments Judges
BM25
Synonyms
Field Boosting
Phrase Boosting
Relevance Decay
Jurisdiction Boosting
Citation Boosting Coming Soon

Keyword search aims to provide clear and intuitive results that clearly match the words, filters, and connectors you queried, but it has no notion of the underlying intent or meaning of your query. This can be limiting.

Semantic search — also known as AI search or vector search — is a modern approach to querying large data sets. Instead of matching particular keywords, it finds the underlying meaning of your query and finds results with similar meanings.

Semantic search is currently available for case law via our API, and we will be bringing it to our website soon.

Semantic search can provide a number of advantages over keyword search:

  • Ranking can be better — Because semantic search understands the underlying meaning of your query, it often provides better results than keyword search, particularly for users that simply type in their problem in plain English, without using advanced query operators.

  • Long queries are as fast — Keyword search engines slow down as you add words to your query. This limits how complex queries can be. Semantic search does not have this problem. Long queries are as fast as short ones, allowing you to provide more information and context to Citegeist.

  • Synonyms are automatic — In a keyword search engine, an administrator must create a list of synonyms for the system to use. Semantic search engines are able to automatically broaden your search to match relevant synonyms.

  • Hybrid search with both semantics and keywords — In addition to pure semantic search, you can enclose specific keywords in quotations to invoke hybrid search. That will retrieve both semantically relevant results and results with high BM25 on the enclosed keywords.

There are also some reasons to choose keyword search:

  • It is predictable — Semantic search engines are powerful, but they can be hard to understand, and sometimes it’s unclear why certain results are returned. Keyword search returns only the results that match.

  • It is complete — Many legal documents are only a few words long (e.g., SCOTUS cases that simply say, “AFFIRMED” or “CERT DENIED”). Such decisions do not have much actual meaning except in the context of other results, and are not well-suited to semantic search engines. We simply do not add these records to the semantic search engine.

  • You can go deep — Semantic search engines only provide the top results. Keyword search engines allow you to deeply research a particular topic.

Semantic search uses a language model and the approximate nearest neighbor algorithm to identify semantic meaning between documents and queries. The quality of the model that we use determines how well the system works. To provide the best results possible, we created a domain-adapted fine-tuned model, which we released to the public for free.

Citegeist uses this model in conjunction with BM25, date decay, and jurisdiction boosting to rank results (see above).

Please Support Open Legal Data

These services are sponsored by Free Law Project and users like you. We provide these services in furtherance of our mission to make the legal sector more innovative and equitable.

We have provided these services for over a decade, and we need your contributions to continue curating and enhancing them.

Will you support us today by becoming a member?

Newsletter

Sign up to receive the Free Law Project newsletter with tips and announcements.