Build a Camera for Thought - Episode 2

Building a Camera to Capture Imagination: Early Passion and Failure

Two years ago, I embarked on a journey to build a start-up based on the belief that camera software should be able to capture people’s thoughts, not just reality. In my previous post (nearly eighteen months ago), I argued that the term “camera app” was misleading and that artificial intelligence (AI) and augmented reality (AR) were poised to disrupt the capturing process. Today, I want to share the tale of our journey over the past two years, which saw us launch more than six apps and test various product hypotheses and technology frameworks. This led us to where we are today—Realm. I will also tell you why I consider generative AI to be the right technology for the thought camera.

Initially, I was confident that short, vertical video content created for mobile devices would be the only format that mattered for new camera products. I believed this since my time at Google (in 2011) where we discovered how YouTube TrueView ads (the five-second skippable ads at the beginning of every video) significantly improved both ad quality and user experience. Witnessing the magic of short video ads convinced me to seek out great products that could generalize this experience to all videos. I later joined Snapchat and was part of several exciting product launches, from Stories and Discover to Search and Spotlight. Meanwhile, I watched as ByteDance’s Douyin and then TikTok spread in China and the rest of the world thanks to their commitment to AI and video technology. All these experiences strengthened my conviction that AI and short videos would lead the way in consumption formats. However, this conviction was somewhat biased, and it caused me to overlook crucial technical progress at the beginning of our start-up journey, as I’ll explain later.

We started by building a simple app, PetThoughts, with the idea of creating a foundation for our AI technology through a fun application. The concept was to transform any pet video into an AI generated video that reflected the owner’s perspective. Our reasoning was that most pet owners see their pets as human friends with individual voices. At the time, there was a popular TikTok filter that demonstrated this idea by using an auto-generated speech bubble for pets.

However, what appeared to be a simple idea turned out to be a significant challenge. We scrambled to implement as many concepts as possible into our vision while building an initial developer team. As a technical founder, my first mistake was failing to clarify the problem, iterate on the go-to-market message, and focus on our bet. We applied the existing smart camera framework to the problem, assuming that people needed the same level of control to create an imagined video as they did for their stories or TikTok accounts. Therefore, we built numerous video-making primitives in addition to many machine-learning components. We did an overwhelming amount of technical work and created a super complicated user experience.

In retrospect, I think that we overcomplicated the app by adding too many features, which detracted from the core concept of capturing people’s thoughts. In the end, we learned that it was essential to focus on the primary value proposition of the product and iterate on the message accordingly.

Speeding Up Experiments to Explore Independent Hypotheses

The first year of start-up experience taught us the importance of having a clear focus and iterating on the core concept of the product to build a more streamlined and user-centric application. As a result, we developed a series of parallel experiments. Each one was an iOS app and concentrated on a single hypothesis that leveraged a specific technology to help people create videos they couldn’t easily make otherwise.

Our first experiment, Bubble Camera v1, was a collaborative video editor that enabled users to cocreate videos with others by editing text, audio, and overlays. It worked like a Github for video editing, where each commit can be branched off in another direction. The idea was to empower collective creativity and help people with limited video editing skills. Although some beta testers enjoyed the tool, we found that it did not trigger frequent use as a self-expression camera, which is what we had envisioned.

So, we developed Bubble Camera v2, which involved cocreating with an AI bot. We introduced the concept of non-playable characters (NPCs) that could automatically collaborate with users to dub videos, auto-caption them, and generate different styles, among other things. NPCs allowed us to experiment more easily with AI models and various transformations. However, we found that the added value based on well-known models could not outpace popular video editing apps and trigger frequent use among early testers.

Moving forward, we explored two different directions. The first one involved finding an application that people would use on a daily basis to cocreate with others. The second one entailed making a novel AI-powered NPC that could produce magical results as we had initially envisioned for PetThoughts. Two ideas that we explored further were Huddle, a video community app that supported asynchronous video threads for communities, and MetaDance, an app that leveraged an NPC to turn any TikTok dance video into a generated 3D video based on the user’s avatar. Although Huddle triggered a lot of interest from communities, users soon found that the lack of expectation of when they would get a video response diminished their initial passion. Similarly, MetaDance faced challenges as it was hard to compose various scenes and achieve perfect lighting in Blender automatically. (Technically speaking, we are still miles away from generating specified 3D scenes automatically.)

During our early exploration of various ideas for NPCs, we came across Midjourney and its advancements in text-to-image and diffusion models. However, we initially dismissed it due to the difficulty of controlling the generation process and its ability to produce only images. My personal bias toward short videos also made me think that the technology was not yet advanced enough for a serious consumer product. I was focused on the existing camera framework where users have complete control over the creation process, but I still envisioned a more magical camera app. Looking back, I realize that I was limited by my own assumptions.

However, our perception quickly changed when we noticed the rising popularity of Midjourney during the summer of 2022. Despite the low quality of the generated art and its limited output (only images), people still enjoyed the process of creating something that aligned with their imagination. The generative technique allowed people to further iterate on their ideas and brought their imaginations to life. Therefore, it dawned on us that this technique could be the key to building the thought camera we had been searching for.

Generative AI: A New Foundation for Building the Thought Camera

As a result of these events, we swiftly launched an app called MetaReal, which allowed users to create AI-generated avatars using text input. The idea behind this app was to provide users with a way to create imagined selfies, just like they do with their smartphone cameras. By simply typing a description—for example, “Show me as Spider-Man”—users could generate an image of themselves in the superhero’s costume. The more detailed their text input, the better the quality of the resulting image.

The text-to-image technique used in MetaReal is part of the generative AI trend initiated by OpenAI. Generative AI techniques are used to create digital content, including images, videos, music, text, and even entire websites. These techniques can produce artistic images that imitate a particular artist’s style, which is similar to the early Instagram filters. To allow users to add their faces to the foundational text-to-image models, we used the DreamBooth technique published by Google Research. This personalized approach allows users to create images that include themselves.

We soon faced competition from Lensa, which offered a simple pay-for-generated-avatar model and used guerrilla marketing tactics to great effect. Their viral success resulted in many copycat apps for creating AI-generated avatars, which flooded the app store. So, we realized that MetaReal, as it was, would not be sufficient to succeed in the market because the trend was too ephemeral. However, during beta testing, we noticed that users were displaying some of the behavior we had been searching for.

MetaReal offers a way for users to imagine themselves in ways that are vastly different from their reality. They use the app to portray themselves as characters in anime stories or movies, as younger or older versions of themselves, or with tattoos that they wouldn’t try in real life. What’s important is not only the end result but also the process of creation, which allows people to write their ideas and see them come to life in seconds. It turns out text is a much easier format for people to express their thoughts while visual formats are more suited for consumption. E.g.:

Writing is a common and familiar method of creation that people use in chats, emails, social media posts, blogs, and more. It is the most common way to record one’s thoughts, ideas, or feelings.
Text allows people to express their thoughts in abstract ways, including describing feelings with words such as “exciting,” “happy,” or “scary” and defining styles such as modern, classic, or pixiv. By using text, people can make vague references to things they have seen before, such as a movie, a show, or a trend. Text is a quick way to express thoughts that develop rapidly, and it is easily edited.
Certain words, known as modifiers, can be especially potent when used in this context. Artistic movements such as the Renaissance, Gothic, fauvism, and pointillism can be used to describe a particular style, while terms such as “minimalism,” “cyberpunk,” and “holography” can define a specific aesthetic. Words such as “canvas,” “splash,” and “sand” can specify the medium.

As mentioned above, modifiers are words that have a strong influence on the output of AI generations. However, not all modifiers are available in the base version of models. To add new modifiers to the system, researchers use techniques such as textual inversions or LoRA, which generate new examples of text that use the desired modifier. These new examples are then used to train the model to recognize and utilize the modifier in generating new images. These engineered modifiers allow for a greater degree of control and specificity in the outputs of generative AI. This approach enables the model to produce a wider range of images tailored to the user’s needs and preferences.

Eighteen months ago, in my previous post, I wrote that “the mobile camera app is a misleading name for what it actually achieves.” I also argued that “the mobile camera app in our pocket should try to facilitate that thinking process in our mind, capturing things that only we can see in a powerful and easy way.” It turns out that these assumptions were limited by my past experiences. A camera for thought doesn’t need to depend on traditional optical instruments; it can capture people’s imagination by simply using words.

MetaReal Is Now Realm

This discovery allowed us to see the future of storytelling and community. After launching MetaReal in early February, we began ideating on a brand that represented our vision: your key to unlocking brilliant, new creative worlds. Our team is beyond excited to officially introduce Realm to the world. Realm is the social platform where your words come to life. Our mobile app is your portal to meeting fellow creatives, learning generative AI, and sharing beautiful visual stories.

Video cameras have let us discover a world of beautiful things and interesting people. Thought cameras will show us an even bigger world of amazing imaginations and fascinating people.

There is a whole new world waiting to be discovered.