That's not a huge problem when dealing with a limited corpus. The LLM can piece together words which you can then tokenize and vectorize and then compare to the same info found in the limited dataset it is supposed to talk about. If the LLM goes off into la-la land have it try again.