Blog
23.01.26

OCR & NER

Email Banners (6)
Author: Ari Ben Am



All of the hype you hear about AI for the past two or so years has focused on one specific type of artificial intelligence: large language models, or LLMs. That’s not the only type of AI that is used in investigations or intelligence collection, though believe it or not, there was a “before time” in which analysts worked without ChatGPT and other LLMs.
Back in the before time of prior to say 2022 or so, analysts often had to work manually on many types of investigations, combing over datapoints themselves. This wasn’t easy to say the least, nor was it even feasible in many cases in which analysts had to work in languages that they don’t speak. Back then, analysts had one main tool available to help them: OCR, or optical character recognition. OCR is the technology that enables computers to identify text in an image or file and turn it into textual data that we can work with, for example copying and pasting or searching, essentially - making it “machine-”readable”.

The Importance of OCR in Data Analysis

At its core, data analysis depends on one fundamental requirement: data must be machine-readable. Vast amounts of information relevant to investigations exist in formats that are inherently hostile to analysis—scanned PDFs, photographs, screenshots, handwritten notes, faxes, and legacy documents. OCR acts as the gateway that converts these otherwise static artifacts into usable data.
Importantly, OCR enables downstream analytical techniques. Once text is extracted, it can be normalized, cleaned, and structured, allowing it to be fed into databases, visualization tools, statistical models, and—more recently—LLMs. In this sense, OCR is not a convenience feature; it is foundational infrastructure for modern data analysis.
ithout OCR, these sources remain opaque. With OCR, they become searchable, indexable, and analyzable at scale. Analysts can run keyword searches across thousands or millions of documents, identify recurring terms, and quickly narrow down large datasets to the small subset that actually matters. This transformation is often the difference between actionable insight and total data overload.
Think of this in the context our day to day life. Want to search through a super long PDF? OCR is the way to do it. Want to scan a picture of a sign in a foreign country and translate what’s written? OCR again.
These same implications are relevant, in a bigger way, for investigations, intelligence collection and any national security work. Need to read a document in a foreign language? No problem, OCR can scan it for you so you can translate it seamlessly. Need to scan thousands of posts with pictures and videos online to find a specific mention or piece of text? With OCR, you can. This may seem niche, but think of the utility: you can search for license plates in pictures to text on shirts or jackets to specific signs and whatever else may come up! The utility in investigations at this level is almost endless.

Entity Extraction and Automation

One of OCR’s most powerful contributions is enabling entity extraction. Once text is digitized, automated systems can identify people, organizations, locations, and other key elements embedded within documents. These entities can then be mapped, cross-referenced, and analyzed as part of a broader investigative graph. That is, assuming you have the right tool to do so, like Falkor.
This automation unlocks scale. Instead of analysts individually reading documents to “notice” important details, systems can flag relevant entities automatically and surface hidden relationships. Entire workflows—from document ingestion to triage and prioritization—can be automated once OCR has done the initial work of converting images into text. Think of the value here - you could upload a PDF of an arrest or incident report, a long strategy document, a cyber threat intelligence report or any other type of data in a file or image and have it be automatically modeled, configured and ready to go for further investigation in your system without having to lift a finger.
Crucially, this process predates LLMs and continues to operate alongside them, with many of them being able to run their own specific version of OCR. Even today, advanced language models rely on OCR output as their raw material when dealing with scanned or image-based data.

Beyond Investigations

The impact of OCR extends well beyond traditional investigations. In compliance, it enables organizations to monitor large volumes of documentation for regulatory risk. In intelligence collection, it allows analysts to process open-source material, leaked archives, and foreign-language documents at scale. In journalism and research, it unlocks historical records that would otherwise remain inaccessible.
Across all of these domains, the pattern is the same: OCR transforms information that humans can see but computers cannot understand into data that machines can analyze and augment.

OCR in the Age of LLMs

As LLMs become more prominent, it can be tempting to view them as a replacement for earlier AI technologies. In reality, they are additive. LLMs do not eliminate the need for OCR—they depend on it. A language model cannot reason over text that has never been extracted.
In many ways, OCR is the silent enabler of today’s AI-driven workflows. It sits upstream of summarization, translation, classification, and synthesis. While LLMs may be the most visible layer of modern AI, OCR remains one of its most essential—and enduring—components.
Understanding this broader AI stack is critical. The future of investigations and intelligence is not about a single model or technique, but about integrating multiple AI tools—each solving a different part of the problem—to turn raw information into insight.

Named Entity Recognition (NER): Turning Text Into Structure

Once OCR has done the critical work of converting images and documents into machine-readable text, the next challenge is understanding what that text actually contains. This is where Named Entity Recognition (NER) becomes essential.
NER is an AI technique that automatically identifies and classifies key elements within text—such as people, organizations, locations, dates, phone numbers, bank accounts, vessel names, or product identifiers. In investigative and intelligence contexts, these “entities” are often the most valuable pieces of information embedded in a document.
Without NER, extracted text remains largely unstructured. Analysts may be able to search it, but they still have to mentally parse what matters and how different pieces relate to each other. NER turns free-form text into structured data that systems can reason over.

Why NER Matters in Investigations

Investigations are rarely about isolated documents; they are about connections. NER allows analysts to move beyond reading individual files and instead analyze relationships across entire datasets.
For example, NER makes it possible to:
  • Automatically identify all people and organizations mentioned across thousands of documents
  • Detect repeated references to the same entity, even across different formats or sources
  • Surface entities that appear unusually often or in unexpected contexts
  • Link documents together based on shared entities rather than keywords alone
This dramatically changes how investigations scale. Instead of asking “What does this document say?”, analysts can ask higher-level questions such as “Which individuals connect these cases?” or “What organizations appear across unrelated datasets?”

Entity-Centric Analysis

One of the most powerful shifts enabled by NER is the move from document-centric to entity-centric analysis. Rather than treating documents as the primary unit of investigation, entities themselves become the focal point.
An entity-centric view allows analysts to build profiles around people, companies, or locations, aggregating all relevant mentions from across the data. This helps reveal hidden patterns, such as:
  • Shell companies linked by shared directors or addresses
  • Networks of individuals operating across jurisdictions
  • Reused contact information suggesting coordinated activity
These insights are difficult—if not impossible—to uncover through manual review alone.

NER as an Enabler for Automation and Discovery

NER also plays a key role in automating investigative workflows. Once entities are identified, systems can:
  • Trigger alerts when high-risk entities appear
  • Cross-reference entities against sanctions, watchlists, or internal databases
  • Prioritize documents based on the presence of specific entity types
  • Feed structured entities into graphs, timelines, and analytical models

This automation reduces noise and allows human analysts to focus their time on interpretation and judgment rather than data extraction.

NER and the Broader AI Stack

Like OCR, NER is not a replacement for LLMs—it is a foundational component that complements them. LLMs may help summarize findings or explain relationships, but NER provides the structured backbone that makes those explanations reliable and traceable.
In practice, OCR extracts the text, NER identifies what matters within that text, and higher-level AI systems build meaning on top. Each layer depends on the one before it.
As AI continues to evolve, NER remains a critical bridge between raw text and actionable intelligence—quietly transforming language into data, and data into insight.
This is where Falkor.ai comes in. Falkor brings OCR, Named Entity Recognition, and advanced AI-driven analysis together into a single, unified system purpose-built for investigations and intelligence work. Instead of stitching together disconnected tools, analysts can ingest raw, messy data—scanned documents, images, multilingual files, and unstructured text—and have it transformed end-to-end into searchable, structured, and analyzable intelligence. OCR unlocks the data, NER surfaces the entities that matter, and automated analysis reveals connections and hidden patterns at scale. The result is faster investigations, deeper insight, and more confident decisions—all in one platform designed to work the way real investigations actually happen.

More resources