Smart photo organization through privacy-first, on-device AI

User privacy can be a big concern when you integrate AI into your app, especially when your app lets users upload personal data, like photos to a cloud server for processing. To show how you can safeguard user privacy while building with AI, we developed a progressive web app (PWA) that brings AI to the photos instead of the other way around. By running the model directly in the browser, we can ensure that it won’t compromise the privacy of our users.

Here’s a look under the hood to see how we approached this.

AI Photo Metadata App Running
AI Photo Metadata App Running

Design considerations

When handling personal photos, security and trust is everything. Thus our entire design philosophy revolved around ensuring the privacy of user data, which led to a few key technical decisions.

On device first (vs on cloud processing)

Normally when building an AI feature, we default to the powerful cloud-based model like the Gemini Pro model. This typically involves sending data, in our case photos, to a data center or server for processing. While this model is capable of more complex reasoning, our main focus is on maximizing user privacy and control in our app.

Since we are handling lots of personal photos, we want to consider that users may feel more comfortable having their photos not leave their device. That is why we chose to use a local model that is running in the browser. Instead of sending photos to the cloud, all AI analysis happens directly in the user’s machine with no data leaving the device and even allowing it to work offline.

To complement this, we built the app as an installable progressive web app (PWA). Being an offline-first approach, this means that the user can launch the app without internet connection to reinforce the convenience and privacy of this app.

Choosing a model

For on-device processing, we opted for a model that is integrated directly into the web browser. This approach had some key advantages. Instead of having to load a large model, we simply need access to the model downloaded by the browser. This is where the Firebase AI Logic SDKs come in, it would be able to detect if a compatible on-device model like Gemini Nano is available, and if not, request the browser to download and cache the model for future use. The nice thing is that once the on-device model is downloaded by the browser, all web apps that want to use that model now have access to it.

Reliable fallback for other browsers

Of course, not all browsers or devices have a built-in AI model. For users on other platforms, we needed a reliable fallback. This is where hybrid AI inference comes in, we can use an on device model when available but fallback to a cloud model as needed.

If an on-device model isn’t available, the SDKs can switch to a cloud based model, in our case gemini-2.5-flash-lite. Though many models would have worked in our case, we found that this model met all of our requirements:

  • Image input
  • Structured output
  • High availability

Flash-lite provided good results for both metadata generation and sorting based on the structured outputs, with the additional plus of being cost-effective.

However, we thought that we should make the switch transparent to the user. Thus, before sending any data, we made a pop up modal to ask for permission to process their photos in the cloud.

Here is a quick code snippet on how we did this in our react app:

App.tsx
useEffect(() => {
const checkNanoAvailability = async () => {
      // Checking availability of the on-device model
      const languageModelProvider = await metadataModel.chromeAdapter.languageModelProvider;
      if (!languageModelProvider || languageModelProvider.availability() != "available") {
      console.warn("Gemini Nano is not available. Falling back to cloud model.");
          setShowNanoAlert(true); 
      } else {
      console.log("Gemini Nano is available and ready.");
      }
  };
  checkNanoAvailability();
  }, []);

  <>
  {showNanoAlert && (
      <div className="modal-overlay">
      <div className="modal-content">
      // Omitting for brevity
      </div>
  )}
  </>
Copied!

The AI portion

With building AI apps, how one asks a model to perform the task is probably just as important as the code itself, here is how we decided to approach it.

Multimodal input

For the app to organize our photos, it had to be able to see our photos. The metadata generation prompt is multimodal, meaning it combines both image (or some other media) and text. The text portion also meant that the user could provide specific instructions to guide the analysis, like “focus on the architecture”, “identify on the seasons” to get more relevant and useful categorization.

To instantiate our model to accept multimodal inputs, we needed to set the createOptions in our getGenerativeModel call:

App.tsx
export const metadataModel = getGenerativeModel(ai, {
      mode: InferenceMode.PREFER_ON_DEVICE,
      inCloudParams: { model: "gemini-2.5-flash-lite", generationConfig: { responseMimeType: "application/json", responseSchema: metadataSchema }},
      onDeviceParams: { promptOptions: { responseConstraint: metadataSchema }, createOptions: {expectedInputs: [{type: "image"}, {type: "text"}]}},
  });
Copied!

Structured output

To use the AI’s output in our application, we need it to be clean and predictable. We enforce a strict JSON schema that the model’s response must follow. This ensures we always get back a parsable object with a description, an array of categories, and its dominant colors, eliminating the need for complex string parsing increasing the accuracy of the response.

Here is our schema that we used for structuring the output:

App.tsx
import { Schema } from "firebase/ai";

  const metadataSchema = Schema.object({
  properties: {
      description: Schema.string({ description: "A concise, one-sentence description of the image." }),
      categories: Schema.array({
      description: "An array of 4-7 relevant keywords.",
      items: Schema.string(),
      }),
      dominant_colors: Schema.array({
      description: "An array of the top 3 dominant color hex codes in the image.",
      items: Schema.string(),
      }),
  },
  required: ["description", "categories", "dominant_colors"],
  });

  const sortingSchema = Schema.object({
  properties: {
      sorted_groups: Schema.array({
      description: "An array of groups, where each group contains images belonging to that category.",
      items: Schema.object({
          properties: {
          group_name: Schema.string({ description: "The name of the category or group." }),
          images: Schema.array({ items: metadataSchema }),
          },
          required: ["group_name", "images"],
      }),
      }),
  },
  required: ["sorted_groups"],
  });
Copied!

Prompts

Our app has two main tasks, so we created two distinct prompts:

Metadata generation

This prompt runs sequentially for each individual image. It takes the image and an optional user query, and returns a JSON object with a description, an array of categories, and dominant color hex codes.

Sorting

After all images have been analyzed, this second prompt takes the entire collection of generated metadata and a user’s sorting preference (e.g., “by event,” “by color”). It then returns a single JSON object with the images organized into logical groups.

This separation makes the process more efficient and the user experience more responsive, as metadata for each image appears as soon as it’s ready.

The code

Here’s how we put everything together using the Firebase AI Logic SDK.

Setting up the models

First, we initialize the Firebase AI SDK. We then configure two different model instances based on our two-prompt strategy. Notice the mode: InferenceMode.PREFER_ON_DEVICE setting, this tells the SDK to handle the on-device-first logic with a cloud fallback, enabling the hybrid approach. We define both the cloud model (inCloudParams) and the on-device configuration (onDeviceParams), including the required JSON schemas for structured output.

App.tsx
import { getAI, getGenerativeModel, InferenceMode } from "firebase/ai";

  const firebaseApp = initializeApp(firebaseConfig);
  const ai = getAI(firebaseApp, { backend: new GoogleAIBackend() });

  // Model for generating metadata for a single image
  export const metadataModel = getGenerativeModel(ai, {
      mode: InferenceMode.PREFER_ON_DEVICE,
      inCloudParams: { model: "gemini-2.5-flash-lite", generationConfig: { responseMimeType: "application/json", responseSchema: metadataSchema }},
      onDeviceParams: { promptOptions: { responseConstraint: metadataSchema }, createOptions: {expectedInputs: [{type: "image"}, {type: "text"}]}},
  });

  // Model for sorting a collection of image metadata
  const sortingModel = getGenerativeModel(ai, {
      mode: InferenceMode.PREFER_ON_DEVICE,
      inCloudParams: { model: "gemini-2.5-flash-lite", generationConfig: { responseMimeType: "application/json", responseSchema: sortingSchema }},
      onDeviceParams: { promptOptions: { responseConstraint: sortingSchema }},
  });
Copied!

Streaming the metadata

When a user analyzes photos, we want them to see the results as soon as possible. Although not a true server stream, we created a stream-like experience in the client. We loop through the images and start the analysis. Using Promise.all would wait for all the images to finish, but by processing them individually in a loop, React can update its state as soon as an image’s metadata is returned.

App.tsx
const generateImageMetadata = async (images, userInput = '') => {
  try {
      const metadataPromises = images.map(async (imageFile) => {
      const imagePart = await fileToGenerativePart(imageFile);
      let prompt = `Analyze this image and generate the following metadata in JSON format: a concise 'description', an array of 4-7 'categories', and the top 3 'dominant_colors' as hex codes.`;
      if (userInput) {
          prompt += ` Focus the analysis on: "${userInput}".`;
      }
      const result = await metadataModel.generateContent([prompt, imagePart]);
      return JSON.parse(result.response.text());
      });
      return await Promise.all(metadataPromises);
  } catch (error) {
      console.error("Error generating image metadata:", error);
      return new Array(images.length).fill(null);
  }
  }

  const handleAnalyze = async () => {
  // ... setup code ...
  setIsAnalyzing(true);
  setStatus(`Analyzing ${imageData.length} images...`);
  
  for (const [index, data] of imageData.entries()) {
      try {
      // Call the AI logic for a single image
      const [metadata] = await generateImageMetadata([data.file], userInput);

      // Update the state for this specific image, triggering a re-render
      setImageData(prev => prev.map((item, itemIndex) => 
          itemIndex === index 
          ? { ...item, metadata: metadata, isLoading: false, error: !metadata } 
          : item
      ));
      } catch (error) {
      console.error(`Error processing image ${index}:`, error);
      }
  }

  setIsAnalyzing(false);
  setStatus('Analysis complete! Ready to sort.');
  };
Copied!

Handling the sorting

After the initial analysis, we gather all the generated JSON objects into an array. This collection is sent to our second model configuration, sortAndCategorizeImages, in a single request. This prompt instructs the model to act as a photo organizer and group the images based on the user’s chosen method.

App.tsx
const handleSort = async (event) => {
  const sortBy = event.target.dataset.sortby;
  const allMetadata = imageData.map(data => data.metadata).filter(Boolean);

  if (allMetadata.length === 0) {
      setStatus('No metadata available to sort. Please analyze images first.');
      return;
  }

  setIsAnalyzing(true);
  setStatus(`Sorting by ${sortBy}...`);
  const result = await sortAndCategorizeImages(allMetadata, sortBy);
  setSortedData(result);

  setStatus(`Images sorted by ${sortBy}`);
  setIsAnalyzing(false);
  };

  export const sortAndCategorizeImages = async (imageMetadataArray, sortBy) => {
  if (!imageMetadataArray || imageMetadataArray.length === 0) return { sorted_groups: [] };
  try {
      const prompt = `You are a photo gallery organizer. Based on the following image metadata, group the images according to the user's preference to sort by ${sortBy}. Image Metadata: ${JSON.stringify(imageMetadataArray, null, 2)}. Return a single JSON object categorizing all images into logical groups. Ensure every image is placed into exactly one group.`;

      const result = await sortingModel.generateContent(prompt);
      return JSON.parse(result.response.text());
  } catch (error) {
      console.error("Error sorting images:", error);
      return { sorted_groups: [] };
  }
  }
Copied!

Making tags portable with EXIF/XMP metadata

AI-generated tags are only useful if the user can use it somewhere. To give the user access to those tags, we provide an export option for the user to be able to save the tags their image. It uses the piexifjs library to embed the metadata as a JSON object within the UserComment field of the image’s EXIF data. This let users to then import their photos to other tools so that they may use the AI-generated tags with other desktop tools for further photo refinement.

App.tsx
const handleExportImage = async (index) => {
      // Omitting for brevity
      const reader = new FileReader();
      reader.onload = async (e) => {
      const imageDataUrl = e.target.result;
      const exifObj = {
          "Exif": {
          [piexif.ExifIFD.UserComment]: JSON.stringify(metadata)
          }
      };
      const exifBytes = piexif.dump(exifObj);
      const newImageDataUrl = piexif.insert(exifBytes, imageDataUrl);
      // Omitting for brevity
      };
  }
Copied!

Conclusion

Building an AI app shouldn’t mean that the users need to sacrifice their privacy. With Firebase AI Logic, we can enable on-device processing for privacy with a transparent cloud fallback for compatibility. Come tell us how you are planning to use Gemini Nano with your application on X and LinkedIn!