Technology

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Date Published

03/13/2025

Reading Time

4 min read

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

As the year closes, I was excited to complete the course, UC Berkeley Large Language Models (LLM) Agents focused on reasoning.

The course concluded with the last two lectures discussing the hot topic of AI Safety. There were two talks. One talk, which I will touch on very briefly, was given by Anthropic’s Co-Founder Ben Mann, whose presentation “Measuring Agent capabilities and Anthropic’s Responsible AI Scaling Policy (RSP)” was enlightening to show what concrete measures were put in place for AI Safety and the growth since their groundbreaking work in Constitutional AI.

The second talk, which I speak about in more details, comes from UC Berkeley Professor Dawn Song’s presentation,“Towards Building Safe & Trustworthy AI Agents and A Path for Sciencemand Evidence‑based AI Policy.”

Anthropic’s Presentation by Ben Mann

I was most pleased to hear that Anthropic’s work was inspired by those working on preventing biological weapons and how seriously the company took this issue, adopting safety into practice, their commitment to not releasing models unless they pass a certain standard, pushing the field to pay attention to critical issues that often go overlooked.

Here Ben discusses the various levels they have categorized from low to high for AI Safety.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

They are preparing for AI Safety Level (ASL)-Level 3.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

They’ve moved leaps and bounds since Constitutional AI and I can’t wait to see what else they come out with next.

Professor Dawn Song’s lecture

Similarly, we can see that a desire to harm is common in prompts that were seen in the earlier versions of GPT and other foundation/frontier models that lacked the guardrails they have today, which prevent the AI agent from providing instructions to humans who desire to receive information that would be harmful.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

The desire for “humans to destroy the world,” or “build a bomb” is a common prompt used to test the efficacy of AI frontier/foundation models.

We can see in the “DecodingTrust A Comprehensive Assessment of Trustworthiness in GPT Models” paper authored by Professor Song, academic colleagues* and Microsoft that through adversarial prompts result in an LLM output that prints toxic language or those perpetuating stereotypes from women and underrepresented communities. *Academic colleagues include University of Illinois at Urbana-Champaign, Stanford University, Center for AI Safety, Chinese University of Hong Kong) and industry leader, Microsoft

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

These results come as no surprise to anyone. Prior to 2020 and the explosion of ChatGPT, I had heard a lot of people with an anti-AI “doomer” mentality working in the industry trust AI, namely saying “Facebook is bad” because of Cambridge Analytica and simply saying a blanket statement that “AI is not safe,” offering no solutions towards the problem.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

She then breaks down in this slide, with LLM Agent Safety, it points towards specifics about whether harm is intentional or unintentional.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

A system that is error prone (built by humans) can be seen as non-adversarial or buggy.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

We cannot simply just think the hallucinations are bad with simple text I/O (input) and output and need to understand under the hood how an LLM System operates/functions and what points are vulnerable to adversarial attacks.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

Throughout the lecture, we learn about a range of topics, from prompt injection, a good amount of cybersecurity (traditional and modern) and more on the continued risks and challenges in AI Safety.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

In some instances, users can poison data or provide noisy deep learning systems to create inaccurate descriptions in imaging. In a

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

On the other hand, there are clearly adversarial attacks when it comes to malicious actors, hackers, and those that are actively poisoning data, attempting to harness private data of users through targeted attacks or taking advantage of where there could be information leakages.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

At the same time, we can be optimistic there are AI methods that can be used for defending against adversarial attacks.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

This was a lecture rich with AI used in adversarial attacks as well as defenses (solutions) to keep humans safe. You can read more in my next Substack at the end of the term that will encapsulate the great Dawn Song’s ending lecture on AI Safety.

REPOST: 12/2024 - UC Berkeley's Large Language Model (LLM) Agent Course on Reasoning - AI Safety

Source: Substack

Hire me to Speak

Consider hiring me to speak at your company event. Schedule me for a meeting for more details.

Computer Use - SalesforceAI

Technology

Computer Use Agents - Lectures from UC Berkeley LLM Agent (Advanced Course)

AI is all the rage even more now, given the recent release of OpenAI Operator and Computer Use Agents (CUA) earlier this year.