While introducing our agents, we frequently are faced with questions that go beyond the core of technology. Users are asking us:
How can I trust an intelligent agent? How do I know that I can rely on artificial intelligence?
We need to consider questions such as: What can we do to help users build that trust? How do we ourselves trust the technology we make? And what “trust” towards an artificially intelligent technology mean in the first place?
With our intelligent agents we enter a new ground of computing, because for many users, it is not common to deal with probability or chances when using computers at work. In many cases, our computing systems are supposed to be exact machines that are correct 100% of the time. While users have experience in understanding probability in weather predictions, advertisements, or speech recognition, most do not experience this in their professional work software. Let us underline that all the information that an agent gathers is actually what a normal person can find on the internet or intranet, respectively. We have trained our models in such a way that they can recognise a trustworthy source from a sceptical one, quite meticulously. Our agents and artificial intelligence in general, however, produce results with a certain chance of being right or wrong.
Precision and Recall: As we detail, there is no one-size-fits-all answer to the trust question, as the level of trust people have in AI varies depending on the perspective or specific context. Our developers and other providers of machine learning technologies perform tasks such as pattern recognition, classification, and automatic valuations. They use statistical indicators such as “precision” and “recall” as performance and reliability metrics for the system. Precision is the probability of a positive prediction to be true – i.e. the results our agent suggests are actually good results. Recall is the probability of a positive target to be predicted positive – i.e. our agent identified all relevant results from the overall data. The higher these two metrics are, the more reliable and trustworthy the system is. This is how we ensure the quality of our agents during development.
Subjective Recall and Subjective Precision
Most of the time, the answer to the trust question looks different for our users. We come across the trust challenge when we provide the agents to our clients, for example, with the task of supplier search. Based on specific user requests, our Company Identification Agent performs fast and requirements-specific supplier searches using information from the public web, mimicking a human search. Out of thousands of company candidates, the agent filters a shortlist of companies that provide a feature set and are based in the specified geography. The question we face is: “How can I trust the agent to find me all the relevant companies?”.
After some more in-depth conversations, we realized that trust is highly impacted by the first encounters with the agent. If the agent shows the companies that the user already knew existed, trust is high, but if the agent missed these companies, trust is low. Users wonder: “If the agent even did not find the companies I know, how can the rest of the list be any good?” This “recall” measure is, of course, highly subjective. It is not a recall against all companies that exist and the ratio of how many the agent identified; it is rather about the agent finding the companies that are on our users’ minds. This makes a lot of sense to them, because nobody can possibly know all the companies that exist, so the only reference data users have is subjective.
We see another phenomenon when it comes to precision. When the agent shares its results, users judge false results, by the perceived level of (human) stupidity. “How stupid was the agent in showing the false result?”, meaning, “How easy is it for a human user to understand that this is not a correct result?” Given that the agent works very differently in how it considers data and connects it, there are indeed false cases which are very easy to detect for humans. For example, a supplier agent detects a record that is clearly not a relevant supplier. But these insights are difficult for the agents, often given their lack of contextual knowledge about the world. Unfortunately, one very “stupid” result of the agent, in a list of otherwise very good results, greatly impacts the level of trust towards the agent. Indeed, humans follows the logic: “If one result is that stupid, how can the agent be right about the other results?”
So in both cases, recall and precision become subjective measures that do not really allow to judge the level of trust that can be put into an agent. In consequence, following these false first impressions users might not trust an agent that works quite well. And, on the contrary, users may put trust in agents that do not necessarily perform well but were “lucky” to match the result the user expected.
At a recent conference, we were talking about this phenomenon and a participant suggested a solution. We all, including our users, should learn about statistics, test methods, etc. and base our trust on the neutral data. While this sounds good, it can be hard to implement, since we cannot all become statisticians and change our emotions. Even if we do, in many cases the test data itself needs to be thoroughly inspected to understand what the results mean for the particular use case that users may have in mind.
Therefore, we suggest a more pragmatic approach:
- In our work we share and explain the process used by the agent, so that users can not only better understand the results and their quality, but also the ways in which they can configure the agents more effectively.
- We further perform a deep analysis of the individual false results and use these explanations to explain the overall behavior of the agent.(Unfortunately, especially for deep learning models, this approach has limits, as the models and datasets become so complex that it is hard to explain the individual result. Like in the case of our patent agent we cannot explain in detail for every predicted classification the logic behind that prediction. Also, sometimes, our core intellectual property is our secret recipe that we cannot easily reveal.)
- That is why we further suggest taking a very pragmatic approach to the results provided by the agents and check if the agent indeed helps with, contributes to, or fully solves the overall task. In many situations, we, indeed, do not need to fully trust an agent, as it only supports a task, and we are not bound to its decision. False results that are easy to detect for humans are usually not very problematic. Much more serious are the results that are false, yet hard to detect with human intelligence.
- We further explain the challenges with the agent and agree with the users not to use single encounters to make judgements about the agents. We encourage users to assign the agents with different tasks to explore their capabilities. Just like with a human colleague, we need experiences to better understand where they excel and what areas they perform in best—and which not.
- Finally, we explain that artificial intelligence works differently from experience and judgment of human intelligence. Therefore, we suggest avoiding comparing the intelligence level of the agents on human scale. A seemingly dumb mistake done by an agent does not mean it cannot perform well in other cases, and vice versa.
What’s next in building trust towards the agents?
As you can see, many of the approaches above require us as human agent designers and engineers to support the users in building trust. This works but is a process that is hard to scale. Therefore, we are working on allowing the agents themselves to support the users on their way of understanding how to build experience and learn about how to professionally deal with trust toward intelligent technology.
These R&D efforts into XAI have been partially funded by the German Ministry of Education and Research (BMBF) and the European Social Fund (ESF) as part of the “Future of Work” program (funding code: 02L18A140).