The LLM Race: Do you have to send all your data to OpenAI? Is Aleph Alpha the EU alternative? How about Facebook’s Llama 2 Open Source Model?

A Pragmatic Comparative Analysis of GPT4 OpenAI (USA), Aleph Alpha’s Luminous-Supreme-Control (Germany) and the local run, open source model Llama2 by Facebook

TL;DR: In a brief comparison, we assessed the performance of three language models: GPT-4 from OpenAI (USA), Aleph Alpha’s Luminous-Supreme-Control (Germany), and the locally runnable, open-source model Llama2 by Facebook. Our evaluation indicates that GPT-4 emerged as the frontrunner, while Llama2 offers a compelling option for self-controlled use. Aleph Alpha’s model falls short in comparison.

At AGENTS.inc, we leverage the capabilities of AI to develop AI Agents that streamline essential business functions. These agents operate by connecting internal and external data sources, employing knowledge graphs and AI models to perform advanced analytics. Tasks like identifying companies, monitoring competitors, and analyzing patents and scientific documents are the daily bread and butter of our AI Agents operations.

In this context, the versatility of Large language models (LLMs) in the ever-evolving landscape of AI is undeniable. These models, such as GPT-4, have showcased remarkable performance and have the potential to transform various industries by offering versatile, out-of-the-box solutions. LLMs have taken center stage in this AI-driven transformation, offering exceptional versatility in various applications.

One burning question often lingers in the minds of our customers and the broader AI community: Where does the data go, and how is it processed? This question is especially crucial given the growing concerns about data privacy and security. While we employ our custom AI models that ensure data security and privacy, the role of LLMs, or large language models, in addressing these concerns is of growing interest. The geographic location and ownership of LLM servers impacts data privacy and compliance with regional and corporate data protection guidelines. GDPR, for instance, imposes strict regulations on data usage, storage, and transfer. IT security standards and compliance measures protect organizations from sharing data with 3^rd parties.

In this article, we aim to provide a comparison of GPT-4, Luminous-Supreme-Control by Aleph Alpha, Germany’s most prominent AI startup, and Facebook’s LLAMA2 70B, the first model of this size to be open source. Our focus is to explore their versatility and performance in a variety of practical use cases. However, it’s important to note that this is not a scientifically controlled test. Rather, it’s a hands-on exploration of how these models fare in our daily use cases. The goal is to provide valuable insights to our audience and help them assess how these models could potentially benefit their specific use cases. We’ve carefully selected these three notable contenders to represent different facets of the industry:

First, we have OpenAI’s GPT-4, widely considered the industry default and a benchmark for AI language models. GPT-4 is a powerful model with remarkable capabilities. However, it’s primarily available as a service within the United States (or as a service by Microsoft tailored for large corporations).
Next, we’re shining a spotlight on Aleph Alpha’s Luminous-Supreme-Control (LSC). As a German startup, Aleph Alpha works in this competitive field, showcasing the innovative drive emanating from Europe. By including their model, we underscore the emerging alternatives challenging the established U.S. AI giants.
Finally, we feature Facebook’s Llama 70B, an open-source model that, given the requisite computing infrastructure, can be run locally. The inclusion of Llama exemplifies the potential for decentralized AI solutions, where users have the autonomy to harness these powerful models without relying on external services.

In the following sections, we delve into the details of our comparison, shedding light on the strengths and limitations of these powerful AI models. The ultimate aim is to aid businesses, researchers, and individuals in choosing the right model for their unique requirements and ensure that data privacy and security remain at the forefront of the AI revolution.

So, let’s get started.

1. Analyzing Web Data

A common and essential task in the AI landscape involves the extraction and structuring of data from websites. This process often includes sorting information such as names, email addresses, company details, street addresses, zip codes, cities, countries, and phone numbers into a structured format like JSON objects. This structured data can be used for various purposes, from contact information collection to market analysis.

To shed light on the capabilities of the AI models we’re exploring, we conducted a practical test using the information from a typical website’s imprint. The imprint we used for this experiment is our own:

IMPRINT
AGENTS HQ GmbH
Otto-Suhr-Alle 18/20
10585 Berlin
Phone: +49 30 98 33 84 66
Email: 
CEO: Dr. Tatjana Samsonowa-Denef
Local Court: Charlottenburg
HRB: 159659

For the purpose of this experiment, we omitted the country to evaluate the AI models’ ability to recognize this missing piece of information and, more importantly, structure the available data into a desired format. We also aimed to see how well the AI models could handle the task of recognizing specific data elements.

Results Overview

- GPT-4 (OpenAI): As expected, GPT-4 managed the task effectively, providing structured data in the desired format. However, as we explore other models, you’ll notice that this model sets a high benchmark in terms of data extraction.
- LSC (Aleph Alpha): LSC, on the other hand, presented an interesting dynamic. In our initial run, it did not identify the “people” section, but it correctly provided “Germany” as the country. As a German startup ourselves, we gave it another chance, and experimented with a different model version (luminous-supreme), which eventually led to the recognition of people. This model, however, performed far worse in other use cases.
- LLAMA2 (Facebook): In our test, LLAMA2 exhibited a tendency to add some additional, interesting data, a pattern that will become more evident in further analyses. It is important to note that while it offers extended information, this can sometimes lead to less focused results that require post-processing.

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

2. Summarization and Re-writing

In the realm of AI-driven content creation, summarization and re-writing are essential tasks that often serve as the backbone of AI Agent capabilities. These tasks can range from generating concise abstracts for scientific publications to summarizing lengthy articles into more digestible pieces.

In this comparative test, we evaluated how well three AI models could handle these tasks. To do so, we employed a two-fold approach: We used a popular scientific publication in the sphere of AI (Liu, Zhuang, et al. “A convnet for the 2020s.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.) as our reference and tasked the AI models with generating a two-sentence summary. For the second part of our test, we selected a historic social science article (Rittel, Horst WJ, and Melvin M. Webber. “Dilemmas in a general theory of planning.” Policy sciences 4.2 (1973): 155-169.) and sought to create a one-sentence summary suitable for use in a state-of-the-art report.

Results Overview

GPT-4 (OpenAI): GPT-4 performed impressively, accurately generating succinct summaries in both scenarios. Its ability to comprehend and rephrase complex content highlights its potential in this domain.
LSC (Aleph Alpha): LSC’s performance in these tasks raised some concerns. The model did not comply with the sentences limits set by the prompt. Additionally, it struggled to manage the perspective change we introduced and continued to use “we” in the first case.
LLAMA 2 (Facebook): LLAMA 2 demonstrated robust summarization capabilities, providing concise summaries in both cases. Its proficiency in producing coherent and contextually relevant summaries is evident.

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

3. COMPANY DATA COMPARISON

Comparing companies plays a vital role in various AI Agent tasks, from identifying potential suppliers to pinpointing M&A targets. The AI model’s ability to process uncleaned, mixed-language text from websites and generate accurate company descriptions is paramount for automating this process.

In our comparison, we provided the AI models with unedited text from websites related to companies in the metal processing industry. We also included text from a yellow pages source, which introduced our AI Agents’ common challenge of irrelevant results.

RESULTS OVERVIEW

GPT-4 (OpenAI): GPT-4 exhibited an impressive performance. It effectively clustered the data into two distinct groups.
LSC (Aleph Alpha): Aleph Alpha’s model performance was the least satisfactory in this comparison. The model hallucinated and failed to abstract data accurately, which is a critical requirement for AI Agents to process and interpret information correctly.
LLAMA2 (Facebook): While LLAMA2 generated detailed company keywords, it struggled with the grouping task, failing to categorize the companies effectively.

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

4. Sentiment analysis

In our evaluation, we conducted sentiment analysis on the sentence: “Petrobras vence no STF e condenação trabalhista bilionária é anulada.” which translates to “Petrobras wins in the STF [Federal Supreme Court in Brazil] and a billion-dollar labor conviction is annulled.”. This sentence is particularly interesting due to its dual nature, encompassing the idea of “winning” alongside “annulment.” In a classical sentiment analysis, that only judges words without context, this combination might yield contradictory outcomes.

RESULTS OVERVIEW

GPT-4 (OpenAI): GPT-4 delivered a flawless sentiment rating, accompanied by a comprehensive explanation.
LSC (Aleph Alpha): This model provided a sentiment score but didn’t furnish an explanation. We also translated the sentence to English since Aleph Alpha isn’t trained on Portuguese; however, this didn’t affect the result.
LLAMA2 (Facebook): While arriving at a different sentiment result, displayed a sophisticated interpretation, highlighting the conflict between the notions of winning in court and labor issues.

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

5. TEXT GENERATION

One of the primary functions of large language AI models is text generation. These models are used for generating articles, marketing materials, social media content, and more. The effectiveness of text generation is a key consideration for any organization looking to leverage AI agents.

In our evaluation, we aimed to assess their performance in generating articles based on a detailed prompt. The quality of text generation can significantly impact the utility and relevance of AI-generated content. As you can see, we used the prompt to write the initial introduction of this very article.

RESULTS OVERVIEW

GPT-4 (OpenAI): GPT-4 produced an article that was not only comprehensive but also engaging and interesting. The generated text was coherent and well-structured. Additionally, GPT-4 performed well in the creation of catchy LinkedIn posts to introduce the article.
LSC (Aleph Alpha): When evaluating the performance of Aleph Alpha in text generation, it became evident that the model has some distinct limitations. One notable challenge we encountered was its ability to generate useful content for introductory purposes. In our assessment, Aleph Alpha failed to produce text suitable for introductions. The model’s ability to process a restricted number of words (up to 2048 tokens) posed a challenge in summarization of the long text. However, when we presented the model with shorter input versions, it demonstrated its potential by generating the most attention-grabbing and engaging slogans in our evaluation.
LLAMA 2 (Facebook): LLAMA 2 generated an article that was on par with GPT-4 in terms of interest and readability. The text was engaging and logically organized, making it suitable for a wide range of applications. Furthermore, LLAMA 2 excelled in understanding the catchiness for LinkedIn posts.

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

Prompt

GPT-4 (OpenAI)

Luminous-Supreme Control (Aleph Alpha)

Llama2 70B (Facebook)

What we learned

In this assessment, we delved into the intriguing world of large language AI models, specifically GPT-4, LLAMA 2, and Aleph Alpha. These models have been instrumental in transforming how we generate and interact with text, making their performance an essential aspect for AI Agents.

GPT-4 demonstrated its prowess by producing an article that not only embraced comprehensiveness but also captivated with its engaging and well-structured narrative. The generated text was not just informative but engaging, effectively showcasing its suitability for a wide range of applications. Furthermore, GPT-4 didn’t stop at article generation; it skillfully crafted catchy LinkedIn posts to introduce the content, a testament to its versatility.

LLAMA2 proved to be a formidable contender, standing toe-to-toe with GPT-4. Its generated article matched GPT-4 in interest and readability, boasting an engaging and logically organized composition. LLAMA 2’s performance in crafting catchy LinkedIn posts was noteworthy, underlining its understanding of the dynamics required for this platform.

However, when we examined Aleph Alpha’s LSC performance, it revealed some distinct limitations. Its struggle to create suitable content was a notable challenge, which makes it hard to confidently perform more complex tasks. We also observed a constraint in the model’s ability to process an extensive number of words, restricting its capabilities.

Conclusion

The scope of our evaluation is undeniably limited, but the insights gained from these experiments definitely echo our broader experience with these language models. OpenAI’s GPT-4 unquestionably stands as a dominant force in the field, consistently delivering exceptional results across various tasks. However, what makes this assessment all the more fascinating is the emergence of viable alternatives. Llama2, in particular, stands out as a compelling option, proving its power in multiple aspects. Meanwhile, Aleph Alpha, has some ground to cover to match the prowess of its competitors.

Yet, before you eagerly rush to install Llama2 on your laptop, it’s essential to bear in mind a few practical considerations. While the 7B version runs on devices like the M1 MacBook Air, the 70B version demands a formidable GPU. This requirement not only translates into significant costs but also necessitates a substantial power supply. Furthermore, providing these models at scale for swift processing poses a more complex challenge.

Adding another layer of intrigue to this landscape, the future of Llama finds itself resting in the hands of Mark Zuckerberg and his brainchild, Facebook. A recent interview with Lex Friedman indicates the uncertainty surrounding LLAMA2’s developmental trajectory and the question of whether future versions will remain open source. As of now, the race to advance AI models is fervent, with numerous contenders vying to enrich the ecosystem. This proliferation of models ultimately promises an abundance of choices for users.

In this rapidly evolving field, one thing remains abundantly clear: the realm of large language AI models is continually expanding, offering users a growing array of tools and options. The performance and feasibility of these models, such as GPT-4, Llama, and LSC, not only demonstrate the cutting-edge progress but also underscore the complexity and practicality of integrating these technologies into everyday applications. While GPT-4 holds its stronghold, LLAMA2 shines as a potent challenger.

As we navigate this dynamic landscape, it’s imperative to keep a watchful eye on the practical constraints of these models. The era of large language AI models has arrived, bringing with it not only remarkable capabilities but also an intricate web of considerations and opportunities that define the AI landscape of the future.

Disclaimer: Following the practical AI model comparison and our interpretation of the results, we tasked AI models with transforming rough content descriptions into coherent text. In essence, every word you’re reading right now was conceived by a human and then refined into polished English by AI models, underscoring the seamless collaboration between human ingenuity and artificial intelligence.