All tracks / Architecture and Technology / Knowledge Graph, Retrieval, and workforce-specific embeddings

Knowledge Graph, Retrieval, and workforce-specific embeddings

How a graph database and proprietary embeddings turn workforce data into computable intelligence

10 min read Architecture and Technology

Why flat data fails workforce intelligence

Most HR technology stores workforce data in relational tables. A person record sits in one table. A role sits in another. Skills live in a third. When someone asks a question that spans those boundaries, the system runs joins across tables, flattening a multidimensional reality into rows and columns. The result is slow, brittle, and structurally incapable of answering questions about indirect relationships.

Consider a straightforward query: which employees have skills adjacent to those required by an open role, have demonstrated those skills in cross-functional projects, and sit on teams with capacity to absorb a temporary transition? Answering that question in a relational database requires multiple joins, subqueries, and application-layer logic. In a knowledge graph, it is a single traversal.

The shift from relational storage to graph-based intelligence is not an optimization. It is an architectural decision that determines what questions the system can answer and how quickly it can answer them.

The Neo4j graph: structure at scale

Gloat’s workforce intelligence layer is built on Neo4j, a native graph database purpose-built for storing and querying highly connected data. The production graph contains over 2.4 million nodes and 18.7 million connections, representing the full topology of organizational capability.

Entity types

The graph organizes workforce data into five primary entity types, each represented as a node class with its own properties and relationship patterns.

Entity type Description Example properties Typical connections
People Individual employees, contractors, and contingent workers across the organization Tenure, location, level, performance band, availability HAS_SKILL, HOLDS_ROLE, MEMBER_OF, WORKED_ON
Roles Positions defined by responsibility scope, not just job title Function, family, level, location requirements, status REQUIRES_SKILL, REPORTS_TO, BELONGS_TO_TEAM
Skills Capabilities validated through experience, assessment, or inference Proficiency level, source, recency, demand trend RELATED_TO, PREREQUISITE_FOR, USED_IN
Projects Time-bound work assignments including gigs, stretch assignments, and formal projects Duration, status, required skills, team size STAFFED_BY, REQUIRES_SKILL, OWNED_BY
Teams Organizational units at any level of hierarchy Size, function, location, capacity utilization CONTAINS_ROLE, MANAGED_BY, COLLABORATES_WITH

Each entity type connects to the others through typed, directional relationships. A person HAS_SKILL connects to a skill node. That same skill node REQUIRED_BY connects to a role node. The role node BELONGS_TO connects to a team node. These connections are not metadata. They are the primary data structure.

Multi-hop traversal

The power of a graph database surfaces when queries require following chains of relationships. A single-hop query asks: what skills does this person have? A two-hop query asks: what roles require the skills this person has? A three-hop query asks: which teams contain roles that require the skills this person has, and do those teams have capacity?

Each additional hop multiplies the intelligence available to the system. In production, Gloat’s agents routinely execute three-to-five-hop traversals to assemble the context needed for recommendations. A redeployment agent, for example, might traverse from a person node to their skill nodes, then to role nodes that match those skills, then to team nodes to check capacity, then to project nodes to identify transitional assignments that could bridge the gap. That entire traversal executes in milliseconds.

Graph databases achieve this speed because relationships are stored as direct pointers between nodes, not as foreign keys that require index lookups at query time. The cost of traversing a relationship is constant regardless of the total size of the graph. Adding another million nodes does not slow down a three-hop query for a single person.

Why generic embeddings fail on workforce data

Retrieval-augmented generation (RAG) depends on embedding models to convert text into vectors and then find semantically similar content. Most RAG implementations use general-purpose embedding models trained on web-scale text corpora. These models understand that “doctor” and “physician” are similar. They do not understand that “Senior Technical Program Manager” and “Staff Engineering Lead” share 70% of their competency requirements despite having zero words in common.

Workforce language is dense with organizational jargon, role-specific terminology, and context-dependent meaning. The word “platform” means something different in an engineering job description than in a sales enablement context. “Leadership” in a senior individual contributor role emphasizes technical influence; in a people management role, it emphasizes team development. Generic embeddings trained on Wikipedia and web crawls collapse these distinctions.

Performance comparison

The gap between generic and workforce-specific embeddings is measurable across every retrieval task in the HR domain.

Retrieval task Generic web embeddings (accuracy) Workforce-specific embeddings (accuracy) Improvement
Skill-to-role matching 61% 89% +28 points
Role similarity detection 54% 84% +30 points
Career path adjacency 47% 81% +34 points
Cross-functional skill transfer 42% 78% +36 points
Project-to-skill inference 58% 85% +27 points

The accuracy gap is largest on the most organizationally complex tasks. Cross-functional skill transfer, the ability to recognize that a supply chain optimization skill applies to workforce planning, requires understanding that generic models simply do not possess. Workforce-specific embeddings close this gap because they are trained on the semantic structure of work itself: job architectures, skill taxonomies, career frameworks, and millions of real role transitions.

Training workforce-specific embedding models

Gloat’s proprietary embedding models are trained on a corpus of workforce-specific data that includes job descriptions, skill taxonomies, career transition records, performance narratives, and organizational structures across hundreds of enterprise customers. The training process uses contrastive learning, where the model learns to place semantically similar workforce concepts close together in vector space and dissimilar concepts far apart.

The training signal comes from observed workforce behavior, not editorial judgment. When thousands of employees successfully transition from Role A to Role B, the model learns that those roles are adjacent in capability space, even if their titles share no words. When a skill appears consistently in high-performing employees across a specific function, the model learns the association between that skill and that function’s competency requirements.

This approach produces embeddings that encode organizational reality rather than linguistic similarity. Two job titles can share every word and sit far apart in the embedding space if they represent fundamentally different work. Two job titles can share no words and sit close together if the underlying competencies overlap.

HyDE retrieval: bridging the query-document gap

Even with workforce-specific embeddings, a fundamental challenge remains. Users ask questions in natural language. Documents are written in organizational language. The query “who on my team could backfill the open product role” does not look anything like the structured data that describes team composition, skill profiles, and role requirements. The semantic distance between the query and the relevant documents is large, and standard embedding similarity search struggles to bridge it.

Hypothetical Document Embeddings (HyDE) solves this problem by inserting a generation step before retrieval. Instead of embedding the raw query and searching for similar documents, HyDE first generates a hypothetical ideal answer to the query. That hypothetical answer is written in the same register and structure as the actual documents in the corpus. The system then embeds the hypothetical answer and uses that vector to search.

The process follows three steps. First, the system takes the user query and generates a plausible answer using a language model, without access to the actual data. Second, it embeds that hypothetical answer using the workforce-specific embedding model. Third, it uses the resulting vector to perform similarity search against the real document corpus. Because the hypothetical answer resembles real documents in structure and vocabulary, the embedding similarity search returns more relevant results than a raw query embedding would.

In workforce contexts, HyDE is particularly effective because the gap between how people ask questions and how workforce data is structured is unusually large. An HR business partner asking about redeployment options thinks in terms of people and situations. The underlying data is structured around skills, roles, and organizational nodes. HyDE bridges that gap by translating the human question into the shape of the data before searching.

How graph and embeddings work together

The knowledge graph and embedding-based retrieval are not competing approaches. They serve complementary functions in the intelligence layer. The graph provides structural traversal: following known relationships between entities to assemble context. Embeddings provide semantic search: finding relevant information based on meaning rather than explicit connections.

In practice, an agent query typically uses both. When a talent marketplace agent receives a request to find internal candidates for an open role, it first uses the graph to traverse from the role node to its required skill nodes, then to person nodes who hold those skills, filtering by team capacity and availability. It then uses embedding-based retrieval to find additional candidates whose experience descriptions are semantically similar to the role requirements, even if the explicit skill tags do not match perfectly.

This hybrid approach catches what either system alone would miss. The graph catches candidates with explicitly tagged matching skills. The embeddings catch candidates whose experience narratives reveal relevant capabilities that were never formally tagged. Together, they produce a candidate set that is both structurally sound and semantically complete.

Entity resolution: building a clean graph

A knowledge graph is only as reliable as its nodes. Enterprise workforce data arrives from dozens of systems: HRIS platforms, applicant tracking systems, learning management systems, project management tools, and performance review platforms. The same person might appear as “Jennifer Smith” in one system, “J. Smith” in another, and “[email protected]” in a third. The same skill might be labeled “Python” in one system, “Python programming” in another, and “Python 3.x development” in a third.

Entity resolution is the process of determining that multiple records refer to the same real-world entity and merging them into a single graph node. Gloat’s resolution pipeline uses a combination of deterministic matching (email addresses, employee IDs) and probabilistic matching (name similarity, organizational context, temporal overlap) to deduplicate entities before they enter the graph.

For skills specifically, resolution maps against a normalized taxonomy that reduces hundreds of surface-form variations to canonical skill nodes. This normalization ensures that graph traversals and embedding searches operate on clean, deduplicated data rather than fragmentary records scattered across source systems.

Query patterns agents use in production

Understanding the graph and retrieval layer becomes concrete when you see the query patterns that agents execute. Each pattern combines graph traversal with embedding search in a specific sequence tuned to the use case.

A skills gap analysis starts by traversing from a target role node to its required skill nodes, then from a person node to their current skill nodes, and computing the difference. An internal mobility match starts by traversing from a person’s skill nodes to adjacent skill nodes (using the RELATED_TO relationship), then from those adjacent skills to role nodes that require them. A succession planning query starts by traversing from a leadership role to the team it manages, then to the people on that team, then to those people’s skills and career trajectory data, ranking readiness using a combination of graph distance and embedding similarity to the target role’s profile.

Each of these patterns executes in under 200 milliseconds on the production graph. The speed matters because agents must assemble context from multiple queries before generating a response, and the total latency budget for a user-facing interaction is measured in single-digit seconds.

Key insight

The difference between a skills database and a knowledge graph is the difference between a phone book and a city. One stores entries. The other encodes relationships, distances, and the paths between them.

Key terms

Knowledge graph
A data structure that represents entities as nodes and their relationships as edges, enabling multi-hop traversal to discover indirect connections such as the path from a person to a skill to a project to a team.
Neo4j
A native graph database optimized for storing and querying highly connected data, used as the structural backbone of the workforce intelligence layer.
Embedding
A numerical vector representation of text that captures semantic meaning, enabling similarity comparisons between concepts like job titles, skills, and role descriptions.
Workforce-specific embedding
An embedding model trained on organizational and labor-market data rather than generic web text, producing vectors that accurately capture the semantic relationships between HR concepts.
HyDE (Hypothetical Document Embeddings)
A retrieval technique that generates a hypothetical ideal answer to a query, embeds that answer, and uses the resulting vector to search for real documents, improving recall for ambiguous or natural-language queries.
Multi-hop traversal
A graph query pattern that follows two or more relationship edges to reach a result, such as traversing from a person node through a skill node to a project node to find relevant experience.
Entity resolution
The process of determining that two or more records refer to the same real-world entity, critical for merging data from HRIS, ATS, LMS, and other workforce systems into a single graph node.
The bottom line

A workforce intelligence layer is only as good as its retrieval system. Neo4j provides the structural backbone, encoding 2.4 million entities and 18.7 million relationships that agents can traverse in real time. Proprietary embedding models trained on workforce semantics close the accuracy gap that generic models leave open, and HyDE retrieval ensures that even ambiguous natural-language queries return precise, contextual results. Together, these components turn static HR data into a live, queryable map of organizational capability.