A Quirky Food Detection Adventure: Building (and Breaking) a Prototype Model

5 min readJan 26, 2025

From pizza and burgers to sushi and tacos, the allure of creating an all-in-one food detection system can tempt even the most seasoned machine learning engineers. The dream is simple — snap a picture of your meal, upload it into a handy web application, and receive a quick bounding box with a confidence score telling you exactly what’s on your plate. In practice, as I discovered while putting together a prototype for a “Food Detection System,” that dream can turn into a precarious balancing act of training data, model performance, and deployment constraints. In this article, I’ll walk you through the journey of building a food detection prototype, highlight some of the laughable mistakes and near-misses, and shed light on why the results can sometimes be…horribly off.

Inspiration and Early Ambition

The motivation behind this project came from wanting to combine the fun of object detection with the universal relevance of food. Object detection, popularized by frameworks like YOLO and SSD, typically performs well on everyday objects such as cars, people, and traffic signs. But specialized tasks like identifying “pizza” versus “steak” versus “tacos” require curated datasets and careful training. I set out to gather examples of various dishes — ranging from typical fast-food items like burgers and fries to more diverse options like sushi and curry. The plan was to use a pre-trained detection model as a base, then fine-tune it on a custom “food dataset” using the food images from Yelp.

Armed with a bunch of images, some labeling tools, and a lot of caffeine, I labeled everything from burger patties to the shape of a taco shell. And so began my foray into the world of building a model that could (theoretically) identify your meal in seconds.

Behind the Scenes: Building and Deploying

To deploy the food detection model for others (and myself) to test, I decided to containerize everything using Docker. By bundling Python dependencies, my detection code, and the model weights into a single container, I could push the image to AWS and run it in a cloud environment. Sure, it took a while to figure out how to handle platform architecture differences, but once I discovered Docker Buildx for multi-platform builds, I was set.

The deployment pipeline looked something like this:

1. Local Build: The Docker image was first created on my machine with a simple docker build -t food-detection-app ..

2. Tagging and Pushing: I tagged the image for an AWS ECR repository and pushed it to the cloud.

3. App Runner Setup: Once in ECR, the image was picked up by AWS App Runner, which orchestrated the container’s runtime environment.

It’s a fascinating feeling to see your detection system up and running on a public endpoint (pictures attached here show the web interface in action). A user-friendly web page was the final piece of the puzzle, complete with a file upload option for images, an area for bounding box visualization, and a console to display confidence scores for each detected class.

Hilariously “Horrible” Performance

As soon as I tried out a real-world image from my own kitchen, the system labeled my sandwich as “soup” with 96.8% confidence (the attached screenshot captures this comedic moment perfectly). Yes, the model believed my bright, ironically colored burger bun was a bowl of soup!

What went wrong? A quick glance at the performance metrics provides some insight:

• mAP50: 0.630

• mAP50–95: 0.562

• Precision: 0.636

• Recall: 0.566

On paper, these aren’t the worst numbers in the world, but they underscore several real-world challenges:

1. Inadequate or Imbalanced Dataset: According to the project’s README, only a limited subset of images were available for each food category. Some dishes, like burgers, had more examples than others, causing the model to overfit on certain shapes or hues.

2. Domain Shift: Most images used for training were well-lit, standard images of neatly plated food. Real-world photos can have unusual angles, lighting, or coloring (like a dyed bun), confusing the model.

3. Hardware and Compute Constraints: Training object detection models requires significant compute. The README also points out constraints of CPU vs. GPU availability, limiting how robustly the model could be fine-tuned.

4. Class Similarities: Certain foods inherently look alike, especially if the angle or partial occlusion hides critical features. A bun could mimic a soup bowl shape if the training set was incomplete or lacked variety in “soup” examples.

Learning from Mistakes

One crucial lesson is the importance of a balanced and representative dataset. During labeling, you might think 100 samples per class is “enough.” But the moment you introduce something unusual — a colorful bun, a skewed angle, or poor lighting — the model’s confidence wavers. The readme for this project emphasizes how “one size fits all” seldom works for specialized tasks. Even though we rely on pre-trained backbones, the domain shift between original training data (commonly on standard object detection sets like COCO) and specialized “food only” data can be immense.

Additionally, computing resources matter. With the local CPU, you might cut corners by training fewer epochs or using smaller batch sizes, inadvertently crippling performance. The consequence is a less robust model that only partially generalizes.

Possible Ways Forward

Despite the humor in seeing a burger mistaken for soup, this prototype offers some valuable hints for improvement:

Data Augmentation: Introduce more varied augmentations — different backgrounds, lighting conditions, color shifts, and occlusion — to mimic real-world scenarios.
Expand Dataset: Gather more samples for each class, especially underrepresented ones like curry or taco. This will help the model better learn nuanced features.
Validation with Real-World Images: Continuously test with images that people might actually upload (e.g., phone snaps of half-eaten meals).
Iterative Fine-Tuning: Retrain the model periodically with new examples of misclassifications (like the “soup burger” fiasco).

In the end, building a food detection prototype is both exhilarating and humbling. The comedic mislabeling of a burger as soup underscores the challenge of bridging the gap between polished training data and the chaos of real-world images. My “Food Detection System” might not be perfect — in fact, it’s far from it. But it stands as a testament to the iterative nature of machine learning projects: test, fail, learn, improve, and repeat.

If you’re considering building a similar system, don’t be disheartened by strange or “horrible” results. Embrace them as part of the learning curve. The process of gathering data, training, and deploying a model — in Docker, on AWS, or elsewhere — comes with a unique set of challenges. Each hiccup is a step toward a more robust system. And who knows? Maybe one day, your model will confidently classify that odd-colored burger for what it really is, instead of deciding it’s a bowl of soup.

A Quirky Food Detection Adventure: Building (and Breaking) a Prototype Model

Written by Serdar

No responses yet