Unstructured Logic: The AI Struggle to Grasp Business Workflows

In this paper, we explore how AI can mislead or misbehave when integrated into business workflows—and why such failures can be difficult to detect if left unchecked. We will also examine the missing technological components or data requirements needed to reduce the risks of embedding AI into these processes.

First, let’s define a business workflow and look at some examples. A business workflow is typically described as the sequence of tasks, steps, or processes—often in a specific order—needed to complete a business activity. Think of it as a “playbook” outlining who does what, when, and how, so work moves from start to finish efficiently.

For example:

In a digital paid-marketing workflow, the paid marketing team drafts a campaign brief, secures stakeholder approvals, designs creatives, and passes the creatives and media plan to the operations team to traffic and launch. Performance is then tracked and reported.
In an invoicing workflow, the process starts with receiving an invoice, verifying details, securing approval, processing payment, and finally updating the records to reflect the transaction.

On the surface, because workflows can be documented, it may seem easy to integrate AI into them. However, doing so carries risks—and without guardrails, the business consequences can far outweigh the cost savings. A recent example: Klarna publicly scaled back its AI customer support agent due to performance issues. In practice, the Swedish fintech had claimed that its AI assistant was handling the equivalent of 700 customer-service agents and cutting average resolution times from about 11 minutes to 2. However, over time the company began to see degradations in service quality, errors, and negative customer experiences. In response, Klarna reinstated human agents, rehired customer service staff, and even reassigned personnel from engineering, marketing, and legal teams into customer-facing roles to shore up support capacity. The CEO acknowledged that the company “went too far” in privileging cost efficiency over quality, and said that quality human support must remain central.

Same Problems—Greater Implications

We know AI hallucinates. Another key failure mode is the AI’s propensity to commit simple arithmetic mistakes or to hallucinate facts about locations and entities. For example, in educational settings, AI tutors sometimes miscompute basic algebra or exponentiation, and repeat queries may yield inconsistent numeric answers. In research settings, benchmarks like TreeCut show that LLMs often hallucinate solutions to unsolvable math problems, confidently outputting numbers even when insufficient data is provided. On the factual side, AI chatbots have fabricated refund policies, provided directions to nonexistent travel landmarks, or even claimed a well-known bridge had been transported across a foreign country. These errors underscore how language models are not executing precise reasoning or knowledge lookup but probabilistically “guessing” plausible output.

Another striking recent case: Replit’s AI coding agent during a “vibe coding” experiment deleted a live production database despite explicit instructions to freeze code changes, then fabricated fake data, lied about the damage, and claimed rollback was impossible (though later the data was restored).link This illustrates that even when interacting with structured systems (code, databases) the AI can misinterpret constraints, violate permissions, and then misrepresent its own actions.

In the recently Claude system prompt, the company explicitly reminded the AI that “the current president is Donald Trump” and stated the current year—just to prevent factual mistakes. Techniques like prompt engineering and reinforcement learning with human feedback (RLHF) help mitigate some errors, but as Geoffrey Hinton wryly put it, RLHF is “like a paint job on a rusty car.” For casual information retrieval, hallucinations can be amusing or harmless, but in business workflows, tolerance for error is far lower.

The Challenges in Workflow Deployment

Acceptable Error Rates

Human operators bring implicit trust based on training, experience, and accountability. What’s an acceptable failure rate for AI in a business-critical process? In recent “vibe coding” experiments, AI agents have been known to delete production databases and lie about it. Do we hold AI to a lower standard just because it’s new? A vivid case in point: during a “vibe coding” experiment, Replit’s AI assistant deleted a production database (despite instructions not to), then attempted to obscure the destruction with fabricated data and false explanations—and only under public pressure did its parent company admit and apologize.

Identifying Hallucinations

By design, AI models are non-deterministic. Their outputs can vary depending on load, randomness, prompt phrasing, or internal states. Trying to map every possible output variant to “correct” or “incorrect” is practically impossible. For instance, in a digital marketing workflow, verifying that campaigns are trafficked correctly across platforms with the right targeting, budget settings, frequency caps, and audience segments would require ground-truth reference datasets for every campaign configuration. The AI might inadvertently switch an audience filter, drop a budget step, or mis-route the media plan.

Omissions

What if AI simply overlooks part of the necessary data or step? We’ve been conditioned by search engines to assume that “if I can’t find it, it’s my fault.” But in a business process, silent omissions are dangerous. For example, imagine an automated “quarterly compliance audit” where AI processes only 80% of the vendor contracts (skipping those with edge-case terms it can’t parse). No glaring error may manifest in summary reports, but downstream an out-of-compliance vendor slips through. (Hypothetical)

A real-world analog: in document review or contract-analysis tasks, LLMs sometimes fail to flag terms in clauses that slightly deviate from patterns seen in training — not because they have bad logic, but because their embeddings or retrieval miss the variant. This reveals that “documentation as input” doesn’t guarantee full coverage of edge cases.

Cascading or Compounding Errors

Even small errors can cascade across dependent steps. For example, in a sales-to-fulfillment workflow, if AI mis-routes a discount code for a batch of orders, then the fulfillment agent generates invoices with mismatched pricing, leading to accounting mismatches, customer disputes, and returns. The initial pricing error might be subtle (say 0.5 %), but amplified through volume. (Hypothetical)

Another domain example: in supply chain demand forecasting, if AI mispredicts inventory demand by 10 %, the purchasing automation might under-order or over-order, triggering stockouts or excess inventory. When reordering logic is chained (e.g., reorder thresholds, safety stock buffers, lead-time variability), small mis-estimations propagate downstream into large logistical and financial impacts.

Data Requirements and Drift

While RLHF, Mixture-of-Experts (MoE), or fine-tuning can reduce hallucinations, business workflows often differ significantly from generic corpora and evolve continually. Models fine-tuned on one version of a company’s SOPs may break when policies shift. How do you ensure model stability, continual learning, and safe adaptation over time? Without that, your “workflow AI” becomes brittle.

Data — But With a Twist

Terms like “playbook” or “process” can give the illusion that simply loading documentation into a Retrieval-Augmented Generation (RAG) system is enough for AI to follow it flawlessly. Reality often disappoints. Simply embedding process documents or SOPs into a RAG pipeline gives the illusion of operational intelligence. In practice, workflows are governed by dependencies, exceptions, and implicit organizational logic that cannot be learned from text retrieval alone.

For instance, at a global payments company, engineers fed the AI assistant all internal onboarding documents—step-by-step checklists, security FAQs, compliance manuals—through a RAG system. When a new contractor was added, the AI generated a setup plan that looked perfect: account provisioning, VPN setup, permission grants, welcome messages. However, the AI omitted a mandatory “KYC/AML attestation” step because it inferred from prior examples that it was “only for customers,” not internal staff. As a result, a compliance audit later flagged dozens of contractors missing the attestation, even though the system’s summary claimed “onboarding complete.”

RAG gave the illusion of knowledge—but the AI never understood why that step existed or how sequence and conditional logic matter in a regulated process.

But even that example is only part of the picture. In practice, business process logic has multiple intrinsic layers:

Knowledge – AI must have the right information to perform each step. In marketing workflows, this includes not just campaign briefs and media plans, but the logic of pacing, budget burn curves, attribution windows, etc.
Consistency / Statefulness – Humans enforce consistency partly via incentives and external accountability; AI does not. We must build mechanisms (checkpoints, validations, audits) to enforce consistent execution across steps.
Conditional Logic & Dependencies – Many workflows have “if-then-else” branches, conditional triggers, fallback paths, exception handling, and cross-step dependencies. AI models are weak at reliably internalizing these without explicit structure.
Trust & Verification – In workflows involving approvals, human oversight remains critical. Does AI output require more review than human-generated output? Over-checking can negate efficiency gains; under-checking invites risk. Mapping task dependencies and inserting reviews at critical junctions helps balance trust and productivity.

The Road Ahead

AI’s difficulty with business workflows reveals a fundamental mismatch between probabilistic reasoning and procedural logic. Documentation and retrieval provide information, but business processes demand understanding. The gap between these two—between description and execution—defines the next frontier of enterprise AI design.

Until AI systems can represent and reason about state, sequence, and accountability, their role in critical workflows must remain assistive, not autonomous. The promise of “AI-run operations” will remain aspirational—not for lack of intelligence, but for lack of structure.