The End of the Eng-Ops Split

2015: Loosely based on a true story.

You are a machine learning engineer at Facebook in Menlo Park. Your task: build the best butt classification model, which decides if there is an exposed butt in an image.

The content policy team in D.C. has written country-specific censorship rules based on cultural tolerance for gluteal cleft—or butt crack, for the uninitiated.

Germany: 0% cleft.
Zimbabwe: 30% cleft.
Cupertino: 0%.
Montana: 20%.

A PM on your team writes data labeling guidelines for a business process outsourcing firm (BPO), and each example in your dataset is triple-reviewed by the firm's outsourced team to ensure consistency. You skim the labels, which seem reasonable.

You decide to train a CNN: it'll be perfect for this edge detection task. Two months later, you've cracked it. Your model goes live, great success, 92% precision, 98% recall. You never once had to talk to the policy team in D.C.

2023: The butt model has been in production for 8 years.

Another email: Policy has heard about LLMs and thinks it's time to build a more "context-aware" model. They would like the model to understand whether there is sexually suggestive posing, sexual context, or artistic intent in the image.

You receive a 10 page policy doc. The PM cleans it up a bit and sends it to the BPO. The data is triple reviewed, you skim the labels, and they seem fine.

You make an LLM decision tree, one LLM call per policy section, and aggregate the results. Two months pass. You are stuck at 85% precision and recall, no matter how much prompt engineering and model tuning you do. You try going back to a CNN, training it on the labels. It scores 83%.

Your data science spidey-sense tingles. Something is wrong with the labels or the policy.

You email the policy team, sending them the dataset and results.

The East Coast Metamates say they looked at 20 of the labels. 60% were good, 20% were wrong, 20% were edge cases they hadn’t thought of.

The butt model lives to fight another day, or at least until these discrepancies get sorted out.

2025.

The butt model is still in production… What went wrong?

The eng-ops split does not work for tasks at the frontier of LLM capability.

Physical and organizational distance between engineering and other teams may have worked for simple machine learning, but creates insurmountable challenges for tasks at the frontier of LLM capability, where engineering and policy decision-making are highly coupled. It doesn't matter if the task is content policy, sales chatbots, legal chatbots, or AI automation in any other industry.

Operations, Product, and Policy must be involved in LLM integration: Any task complicated enough to require an LLM will need policy expert labels.^[1] To detect nudity, outsourced labels will do, and an engineer can eyeball results without much trouble. To enforce a 10 page adult policy, a policy expert must label the data, look at results, and collaborate with engineers to determine the best AI integration.

Is a discrepancy between the LLM result and the ground truth due to the LLM's mistake, the labeler's mistake, or an ambiguous policy line? Resolving these questions is the primary blocker for LLM integration, not tuning hyperparameters or trying novel RL algorithms.^[2]

Engineer? There's a version of this article for you: The End of the Train-Test Split

Core problems with complex classification tasks

Policy Experts write abstract rules. For older classification tasks, the rules had to be simple, because simple was all that models could do. Are there guns in the image? Is there a minor in the image? Complicated tasks are harder to pin down.

What is hate speech?
What is sexually suggestive?
What makes false advertising cross from overdone to illegal?

Take these examples. Sexually suggestive? Artistic expression? Sufficient censorship?

Source: Lorde's album covers from Vice.com, where you can find discussion of what is visible or not in each photo.

Most policy documents are not well-kept. A content policy is typically simplified, operationalized, and sent to outsourced teams to enforce. Edge cases and discrepancies found in India never make it back to policy teams in DC, so abstract rules are rarely pinned down.

In production, we see 15-20% false positive rates on BPOs. Half are attributable to human error, half to policy gray area.

To resolve edge cases, labeling tasks require an expert's time. BPO agents can eyeball how much butt is visible, but struggle with what is "sexually suggestive." BPOs are low-wage workers in countries like India or the Phillipines, and may have different definitions of "sexual context" than the policy writers intended. The costs of training them are often prohibitive at scale.

Using in-house agents is not sufficient for training data, either, as small alignment issues in the dataset cause large issues in production: If internal agents are 95% accurate (pretty good), the ceiling for the LLM's performance is 95%. If the LLM gets 95% of those labels right, its accuracy will be 90%.

Hard classification tasks have high rates of expert disagreement. Ask two people if there's a gun in an image, odds are they'll agree. Ask two policy experts whether a pose is sexually suggestive per their definition, you will start a one hour debate. If two experts only agree 95% of the time, then hand off to internal agents for labeling at 95% accuracy, then the LLM is 95% accurate, you are down to 86% LLM accuracy.

Language models see details in data that even experts miss. LLMs read every word of the product description. They scrutinize every image for a single pixel of gluteal cleft. They see gluteal cleft through sheer clothing, in low opacity, and in the bottom left corner of the back of a t-shirt. Even if experts have reviewed the data, there must be a re-review and feedback loop to check what the LLM has flagged.

Since labels are often wrong or ambiguous, you must review discrepancies between the LLM and the ground truth. Did the expert missed something, was the policy ambiguous, or was the outsourced labeler wrong? You have to check the results and the LLM's explanation to find out.

In production, we see anywhere from 15-30% error rates in data considered "golden sets" by operations teams.

Since these classification tasks are so complex, you can only "debug" the model by explicating the policy rules in more detail for the LLM's understanding. A policy might have dozens of rules and thousands of possible inputs, creating a fat-tail of model mistakes. Unlike traditional machine learning, where you fix mistakes by changing the design or hyperparameters of your model, you fix LLM mistakes by changing the prompt. You can directly fix a mistake (e.g. by telling the model "do not consider spreading legs fully clothed to be sexually suggestive"), so fixing mistakes is fundamentally a policy task, not an engineering one.^[4]

The Policy and Engineering Teams Need to be in Direct, Frequent Communication. The SF-DC split doesn't work anymore. Resolving edge cases and, in many cases, changing the policy to reflect patterns identified in the data requires collaboration.

Experts have historically not needed to look at the data—seen as a low-status task—but it is the only way to achieve high accuracy. This is an unsolved problem in many large organizations that blocks LLM integrations.

The "training" step for language models has to be policy alignment, not heating up GPUs. Since the data will always be flawed and the test set won't be blind, the machine learning engineer's priority should be spent working with policy teams to improve the data. That means surfacing edge cases and policy gray areas, clarifying policy definitions, and leveraging LLM outputs to find more discrepancies until data is high-quality and policy is clear.

In production, this is an ongoing process, as LLMs will always surface new interesting cases and policies will continue to change. Policies and enforcement are better for this feedback loop: it enables consistent, scaled enforcement platform-wide.

Today and Tomorrow

This is a paradigm shift that many machine learning teams, and enterprises as a whole, have not yet embraced. It requires a complete change in how we approach classification tasks.

If this is the road to automation, is it even worthwhile? The process described above, while arduous, is the shortest route to consistent policy enforcement to date. Before LLMs, running a successful quality assurance program would be prohibitively expensive. Retraining human agents takes far longer than retraining LLMs. Policy experts have historically never been owners in quality assurance processes, but can be now. T&S policy enforcement decisions have often been left to human discretion, causing inconsistency: aligning on a robust policy is higher leverage than ever before.

To save a little time, an in-house human agent might do a first review of the results, then a policy expert can review only the discrepancies. We find this tradeoff works well in production.

What are the implications for leveraging LLMs for tasks which do not have binary classifications? Can an LLM be a lawyer if this much work is required to align, evaluate, and test models? Will an LLM ever ~know what you mean~ and skip all these alignment steps?

One core problem with the LLM architecture is that the model doesn't know when it is wrong. Model improvements over the past few years mean the LLM is right more often, but when it is wrong, it doesn't have an outlet.

This is a perennial machine learning problem: a model does not know what is "out of distribution" for itself.

Until that problem is solved, there will have to be an engineer in the loop improving and testing the model, and a policy expert evaluating the results. You can do this for complicated tasks like writing a patent application, but you have to be rigorous, define a rubric, curate expert data, and regularly evaluate model outputs. Calculating accuracy of each "training run" will never be as easy as checking if model_output == ground_truth, and will require a human in the loop. These complex tasks are far more lucrative than binary classification, and smart people are working on them.

Not everybody will take this rigorous approach, and as models improve, they might not have to. Until then, the highest leverage way to spend your time in 2026 will be looking closely at your data, cleaning your data, and labeling your data.

^[1] A policy expert may literally be someone with “Policy” in their title, or it may be a lawyer, accountant, sales rep, or anyone else documenting a business process that needs to be automated.

^[2] Note that LLMs can be trained or tuned on specific datasets for classification tasks, especially reasoning models, but these tend to be much smaller, higher quality labels rather than scaled, medium quality labels. Even in these cases, an engineer must spend a large amount of time looking at “test set” data and labels.

^[3] After several runs of this prompt-based iteration following by QA on unseen, production data, you may have enough data for real LLM training on a train-test split.