Learning from Accountants: Human-in-the-Loop Systems Done Right (May 13, 2026)
TL;DR
The best approach to automated invoice coding is not full replacement of AP clerks, but human-in-the-loop systems where AI handles routine cases and human experts focus on complex ones. Effective systems should provide confidence scores and explanations for their predictions so reviewers can quickly decide whether to accept or correct them. The interaction between human and machine creates a powerful feedback loop—both implicit (e.g., corrections and review behavior) and explicit (e.g., structured questions)—that enables continuous improvement of the AI over time. Striking the right balance in learning speed is key: too fast risks overfitting to outliers, too slow frustrates users. Over time, the share of invoices needing significant human intervention should shrink, dramatically reducing manual labor while preserving accuracy and control over financial records.
Full Post
Overview
The traditional approach to invoice coding relied almost exclusively on human AP clerks. With money on the line, their work was crucial, but often monotone: Imagine dealing with a "big pile" of invoices every day, many following similar patterns in terms of what needs to be extracted and coded, but each requiring full attention to avoid costly mistakes. As the advent of AI has started to radically transform this process, what will be the future of invoice coding? I believe the best approach to automated invoice coding is not to replace AP clerks, but to let AI take care of the "clear" cases so that human experts can focus their energy, insights, and judgement on the more complex ones. In short, the present and foreseeable future of invoice coding are human-in-the-loop systems. I will briefly explain what they are and how they can be applied to invoice coding.
What are Human-in-the-Loop Systems?
In a human-in-the-loop system, an automated agent—now often powered by machine learning or AI—handles high-volume, routine cases quickly and without interruption. Whenever a case exceeds a risk or uncertainty threshold, or when policy demands it, the system escalates to a human reviewer who can evaluate, correct, or approve the outcome before it is finalized. The motivation is to combine the speed and consistency of automation with the judgement and accountability of human expertise. With machine learning and AI, one can go a step further: the system learns from each human decision, steadily improving its accuracy and expanding the share of cases it can handle autonomously over time.
Applying the Human-in-the-Loop Paradigm to Invoice Coding
In the context of invoice coding, the ideal human-in-the-loop system predicts all fields automatically, followed by human experts approving these predictions. As I discussed in earlier posts (Why Invoice Coding is an AI Grand Challenge, Inside the Noise, and Context Matters), automated invoice coding is challenging and requires not only advanced document understanding, but also familiarity with customer-specific accounting practice. Hence some invoices will require more human involvement, from correcting individual field values to coding major aspects from scratch.
For an effective interaction between machine and human expert, the AI should provide a clear indication of the uncertainty or confidence level associated with each prediction. For example, "I believe the best answer is vendor V and I am 90% confident in this prediction." This way the human reviewer can quickly identify the best course of action, from investigating and correcting individual fields to starting over. (The latter may be necessary when crucial fields such as vendor code or property code are mispredicted.) Production systems are beginning to implement this: PredictAP, for example, recently introduced automatic review routing based on per-field confidence across vendor, property, and GL account—routing invoices to human reviewers precisely when the model is least certain.
Note that designing good confidence measures represents a research challenge in itself. Established success measures such as model calibration and related remedies from isotonic regression to Platt scaling consider the big picture, i.e., if a model generally provides accurate confidence scores for its predictions. Unfortunately, a high calibration score does not guarantee that each individual prediction's confidence value will be accurate. And what does 90% confidence even mean? Intuitively, it says that in 9 out of 10 "comparable" cases, the prediction would be correct. The problem here is that "comparable" is not well-defined, because invoices may differ from each other in many ways.
To further assist the human reviewer, the AI should provide explanations for its predictions by highlighting relevant text in the invoice or by presenting its chain of reasoning. This way the reviewer can better understand the rationale behind the AI's suggestions and make informed decisions about whether to accept or correct them.
In my work with PredictAP, I have already seen several cases where the AI got things right and the initially skeptical human reviewer ultimately agreed with it. Like meaningful confidence scores, producing good explanations remains a research challenge, especially for complex models such as large language models. Since the gigantic neural networks underlying modern AI encode complex patterns in high-dimensional spaces, it is often difficult to extract simple and intuitive explanations for their predictions. Not surprisingly, explainable AI has attracted a lot of attention in the research community.
Another way for the AI to explain its predictions is by showing related cases from the past along with their outcomes. A typical example would be to present last month's water bill to justify the coding of this month's water bill. Unfortunately, determining what considers a "related case" generally is not that straightforward. While a past invoice from the same vendor may provide a useful reference for the predicted vendor code, differences in items purchased or in which property a service was provided may make it less relevant for the predicted GL account or property code. Hence a simple notion of "the most relevant invoices are those most similar to the given one" would often fall short.
Counterexamples deserve special attention here. Sometimes the most convincing justification is not a confirming case but a contrasting one. An invoice that looks very similar to the given one yet was coded differently. Surfacing such a counterexample forces the reviewer to confront exactly what distinguishes the two invoices—and why that distinction matters for the coding decision. This is a qualitatively different kind of explanation than similarity-based retrieval, and arguably a more powerful one: it makes the decision boundary visible rather than just illustrating the predicted side of it. Building a system that can reliably identify and present useful counterexamples is an open and under-explored research problem.
How do AI systems learn from accountants?
As I discussed in a recent post, automation of invoice coding requires domain-specific AI that not only posses a general understanding of language and "the world," but is also familiar with a customer's specific coding rules and preferences. Arguably, the database of past coded invoices represents the most important materialization of the accountants' combined knowledge and experience. Human-in-the-loop invoice coding provides powerful additional opportunities to learn from human experts based on their interactions with the system. They generate two kinds of feedback: implicit, derived from the accountant's actions, and explicit, solicited through structured questions.
Implicit feedback is derived quasi for free from the accountant's actions: When the accountant changes a predicted value, they tell the system not only that the initial answer was undesirable, but they also state what they consider a better answer. In addition to this strongest and most direct feedback, one can also learn from weaker signals. For example, if the accountant spends a lot of time investigating a particular prediction, it signals a level of doubt or uncertainty about it. This could enable the system to learn which cases or fields are more challenging and to adjust its confidence scores accordingly. Similarly, the order in which the accountant reviews and corrects individual fields may provide insights into which fields are more critical or depend on each other. GL accounts, for instance, may be vendor- or property-specific. PredictAP retrains its models daily on customer corrections, giving the system a near-continuous signal from accountant behavior without any additional effort on the reviewer's part.
In contrast, explicit feedback requires a little extra effort beyond the accountant's regular work. This could consist of simple yes-no questions about the AI's predictions or explanations, like "was this explanation helpful?" For more detailed feedback, the system could present multiple alternative explanations and ask which of them was most helpful. A more advanced approach could ask the human to explain why a prediction was wrong and what information would have helped them most identify the error and correct it. Feedback could be solicited in different formats: A free-form response is often easy to write, but difficult for the system to learn from. Multiple-choice questions provide more structure, thus facilitating automated learning. However, designing good options that cover all scenarios and make it easy for the accountant to select the most relevant one can be challenging.
Conclusion
In my opinion, AI-powered human-in-the-loop systems represent the best approach to invoice coding for the present and the foreseeable future. They automate the "boring" aspects of the process, while letting human experts focus on cases where their deep knowledge, judgement, and insights are most needed. The interaction with the accountant provides a rich source of feedback that enables continuous system improvement. Much of it is generated "for free" as part of the regular work, but investing a little extra effort into explicit feedback can pay further dividends. Designing the learning process is a key issue: If the system learns too quickly, it may overfit to outliers, like when the human gets it wrong. If it learns too slowly, it may be frustrating for the user who has to correct the same type of error multiple times before the system learns from it. Over time, the proportion of invoices requiring significant human intervention should shrink as the system learns from the corrections it receives. This approach strikes a practical balance: it dramatically reduces the manual labor involved in coding large volumes of invoices while preserving accuracy and control over the organization's financial records, which is especially important for audit compliance and reporting integrity.