Indexing and AI

Library employees working at computers, UTA Central Library

There’s rampant speculation about the impact of AI on many aspects of publishing. Here, indexer Tanya Izzard looks at the possible implications of AI for indexing.

Definitions and how it works

Artificial Intelligence (AI) can be defined as technology that can mimic cognitive functions associated with human intelligence. Generative AI is a type of AI that analyses and responds to spoken or written language. Generative AI comprises a set of technologies that include:

Machine Learning (ML) technologies, where machines learn and improve through algorithmic analysis of data
Large Language Models (LLMs), the repositories of data that inform ML and that generative AI tools use to perform functions
transformers, the code elements that an AI tool uses to query LLMs. Hundreds or thousands of layers of transformers can be required to power an AI tool.

The AI tools we’ve seen most of are chatbots like ChatGPT. When you ask a chatbot something, it responds algorithmically to predict and create a suitable answer. This has some limitations:

chatbots need precise and targeted prompts to give specific answers. Prompt writing is a complex skill and inadequate prompts result in limited responses.
chatbots are prone to hallucinations, where they invent information, partly due to the multiplicity of transformers and partly because the chatbot is predicting the most likely good response to the prompt.
again, because the chatbot is trying to give the most likely answer, using statistical probability rather than judgement, responses can be inaccurate
bias and ethics: a chatbot will reflect the biases of its LLM and its training. Ethical data acquisition for an LLM is a contentious issue.

AI and publishing

It’s fair to say that, for the publishing sector, AI is both opportunity and threat. Various legal cases are in progress to help copyright owners protect their intellectual property from automatic incorporation into LLMs. But some publishers are looking to generate profit by using their content in LLMs.

There are already AI tools in use in the sector, helping with repetitive tasks like managing metadata. AI function has enhanced existing tools for spelling and grammar checks. AI tools exist that can check for copyright infringement or plagiarism, and others that can bring eBooks and audiobooks to market very quickly. All these tools offer potential efficiencies and improved profits.

AI-generated content, however, is more problematic. It may have problems with accuracy or originality, and the sheer speed of generated content means that some publishers have been flooded with AI-created submissions.

And protecting copyright means that freelancers may not be able to use generative AI tools at all. Freelancer contracts will usually prohibit sharing manuscripts or proofs, and some now specifically prohibit the use of AI tools.

Human skills of indexing

In this debate, it’s useful to remember the specifically human skills of indexing that cannot be easily replicated by a generative AI tool working algorithmically, not rationally. These include:

empathy: the ability to put ourselves in the place of readers and consider what they need from an index, what they would search for, what their priorities would be.
judgement: we make rational decisions on what to leave out, what to put in, and index concepts, not just words. We include implicit concepts in the index that could not be found by searching. We also create usable index structures with multiple points of entry.
understanding: through subject expertise, we understand the content of a book, but we understand the context of a text as well.

We bring our own knowledge and life experience to each index we compile, as well as our indexing skills. We create human indexes for other humans to use.

Indexing: what can current AI tools do?

At the time of writing, there are no AI indexing tools. Chatbots will suggest index headings for sections of text if you ask them, but the results are patchy, with poorly formed headings and not all concepts covered. Asking them to include page numbers often results in inventions, even if you point them at a particular edition of a work in the public domain. This is because the tools are making algorithmic, not rational judgements. Some chatbots recognise this, and will advise you to consult a professional indexer.

In any case, for the contractual reasons noted above, it’s usually not possible to use tools like this in our work. And why would you outsource the fun parts of your job to a machine, anyway?

Automated indexing tools already exist that will query a whole text and produce a draft index; an example of their output is included in Dennis Duncan’s book Index, a History of the (Allen Lane 2021) and shows the limitations of these options – so much checking and editing is required that it would be quicker to index from scratch yourself.

Indexing: AI possibilities

AI tools may emerge that are helpful with part of the indexing process. For example, you could create a tool that collated the placenames in a text, and use that as a checklist to ensure you’d included all the significant mentions. Error checking is another area where new tools might be helpful.

Even though current AI tools are limited in the headings they create, given the speed of Machine Learning, it’s quite possible that an AI tool could be developed that would create a reasonably accurate index for a text. It might be algorithmic and limited, but efficiency savings might mean that it was good enough for our current clients. Various scenarios for the indexing profession emerge from this possibility, none of them encouraging:

human indexing simply becomes redundant, as publishers embrace the cheap but limited AI index option.
human indexers are still needed to check and refine an AI-produced index. In this scenario, there is still work for indexers, but whether it would command high enough fees for the work involved is debatable.
human-produced indexes become a niche artisanal product for a particular type of client or book.

As with many other knowledge professions, the development of generative AI could represent an existential threat to our line of work.

Indexing: AI obstacles

Before we descend into existential despair, we should note the barriers to further AI development. The march of AI is not inevitable and there are obstacles to progress:

structural obstacles, such as shortages in programming skills and the availability of raw materials and components
economic obstacles, like the profitability of AI tools, the possibility of an overhyped investment market, and the needs of venture capital to generate return on its investment. Is an indexing tool profitable enough to merit development?
regulatory obstacles, as governments consider controls on AI development. Legislation to address climate change may also inhibit AI expansion.
reputational obstacles: AI-generated content can be biased, inaccurate or unoriginal, qualities that no author or publisher wants for their books.

In thinking about risk, we evaluate both the impact of an event and the likelihood that it will occur. The potential impact of AI on indexing is severe; the likelihood much less certain.

AI and indexers

So what can we, as indexers, do in the face of all this?

Firstly, we can keep abreast of developments in AI tools and how the publishing sector responds to them. The IPG and Publishers’ Association publish regularly on AI topics and run training events. CIEP has a useful series of blog posts on the implications of AI. The NUJ and Society of Authors look at AI from the point of view of writers.

We can continue to emphasise the benefits of human indexing. We are humans doing a job, not machines performing a task: our work is underpinned by training, CPD, experience, establishing relationships with clients and other indexers, and reflecting on our progress. Machine learning may be fast, but it is not necessarily reflective.

Most importantly, though, we can make the case for human-produced indexes based on empathy, sound understanding and rational judgment, indexes that take account of the needs of other humans.

Acknowledgements

My thanks to Dr Catherine Pope, trainer, coach and technical communicator, for help with the technical content of this post.