print("AI Bias")Notes from Amanda Levendowski's Paper
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3024938
HOW COPYRIGHT LAW CAN FIX ARTIFICIAL INTELLIGENCE’S IMPLICIT BIAS PROBLEM``
Abstract
AI systems can reflect or exacerbate societal bias, from racist facial recognition to sexist NLP.
These biases threaten to overshadow AI's technological gains and potential benefits.
May sources of bias:
- homogeneous creators
- flawed algorithms
- incomplete datasets
Role of law itself: copyright
AI learn to 'think' by reading, viewing, and listening to copies of human works.
Copyright law's exclusion of access to certain copyrighted source materials may create or promote biased AI systems.
Copyright law limits bias mitigation techniques, such as testing AI through reverse engineering, algorithmic accountability processes, and competing to convert customers.
The rules of copyright law also privilege access to certain works over others, encouraging AI creators to use easily available, legally low-risk sources of data for teaching AI, even when those data are demonstrably biased.
Very similar issue with public medical imaging datasets, since we have very limited sources, typically from only a few sites in the US, which is what a lot of the world is using (+ their local hospital data if researchers or companies)
A different part of copyright law -- fair use doctrine -- has traditionally been used to address similar concerns in other tech fields, and may be capable of address AI bias.
In large part because the normative values embedded within traditional fair use ultimately align with the goals of mitigating AI bias and quite literally may help create fairer AI systems.
word2vec word embeddings`
Can recognize that Beijing is to China in the same way as Warsaw is to Poland, as capital and country, but not in the same way as Paris relates to Germany.
word2vecis sexist
man is to computer programmer / woman is to homemaker
If the underlying dataset reflects gendered bias, those biases would be reinforced and amplified by sexist search results.
Due to their widespread usage as basic features, word embeddings not only reflect such stereotypes but also can amplify them.
Google licensed articles from global news agencies to create
word2vec, and open sourced the toolkit, but the Google News corpus was not released.It's unlikely that any researcher could get a similar license deal, even in a bid to create a less biased corpus.
Without access to the underlying corpus, downstream researchers cannot examine whether a news outlet or journalist exhibits gender bias across multiple articles.
Nor could they supplement the corpus with data derived from additional, less biased works.
Garbage in, garbage out
Copyright law causes friction that limits access to training data and restricts who can use certain data.
This friction is a significatn contributor to biased AI.
The friction caused by copyright law encourages AI creators to use biased, low-friction data (BLFD) for training AI systems, like word2vec, despite those demonstrable biases.
This also prevention of bias mitigation techniques, like reweighting algorithmic inputs or supplementing datasets with additional data.
Copyright law can even preclude potential competitors from converting the customers of dominant AI players.
Part I - Teaching Systems to be Artificially Intelligent
Toy datasets like MNIST and Cats / Dogs may not be that relevant, but when serious systems are deployed and have unintended biases, they cause real problems.
- Man mistakenly flagged as a different fraudulent driver
- Taiwnese student couldn't renew passport, because AI system thought his eyes were closed
- Commercial AI systems used by law enforcement are consistently less accurate for women, African Americans, and younger people.
The implicit biases resulting in Type 1 and Type 2 erros become important and even dangerous.
Part 2 - Copyright Law Causes Friction For Creating Fairer AI Systems
The internet may be full of cats, but it does not follow that the photographs and videos featuring those cats are free for anyone to use.
It remains an open question whether copies created
for purposes of training AI systems constitute “copies” under the
Copyright Act, which defines “copies” as “material objects . . . in which
a work is fixed by any method now known or later developed, and from
which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.
Thus, certain “copies” may be so fleeting that they are not considered
copies at all.65 Google, for example, has developed a technique called
federated learning, which localizes training data to the originating
mobile device rather than copying data to a centralized server.66 It
remains far from settled that decentralized training data stored in random
access memory (RAM) would not be considered “copies” under the
Copyright Act
Thus, the rules of copyright law can be understood as
causing two kinds of friction: competition and access. From a
competition perspective, copyright law can limit implementation of bias
mitigation techniques on existing AI systems and constrain competition
to create less biased systems. And from an access perspective, copyright
law can privilege the use of certain works over others, inadvertently
encouraging AI creators to use easily available, legally low-risk works
as training data, even when those data are demonstrably biased.
Part 3 - Invoking Fair Use To Create Fairer AI Systems
🌅