Alexander Doria

@dorialexander.bsky.social

502

Followers

236

Following

106

Posts

LLM for the commons. Cofounder Pleias

Posts Replies Media

Doing experiments of synthetic literature with a freshly finetuned llama 8b: Plato's Republic as a film noir works surprisingly well.

Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian huggingface.co/datasets/Ple...

My phd thesis on finance speculation and mass media in 19th century France was really not supposed to be an how to. But as it turns out, things haven’t changed much…

Reposted by

Alexander Doria

This meme is too good for Twitter, so I'm just dropping it here.

Currently working on an OCR correction model and it accidentally creates hallucinated fiction when the source is *really* noisy.

Small announcement: opening Paid Research Internship opportunities at a startup to train open science LLMs. Profiles could be AI-focused (with some familiarity/experience in open source LLM communities) or data-focused (with a DH background). Based in Paris, full remote possible.

Announcing the release of marginalia, a python library to perform corpus analysis and retrieve structured annotations with open LLMs like Mistral Open-Hermes-2.5. github.com/Pleias/margi...

Reposted by

Alexander Doria

Am feeling really happy and relieved that 3 days into the new year, and the Big AI News is...copyright-respecting creation of cartoon mice! Courtesy of @dorialexander.bsky.social . huggingface.co/Pclanglais/M...

Pclanglais/Mickey-1928 · Hugging Facehuggingface.co We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Happy new year! And happy public domain day with a major new entry: the original design of Mickey Mouse! For the occasion I’m releasing Mickey-1928 a model on Hugging Face that can generate pictures of Mickey, Minnie and Pete from 1928. huggingface.co/Pclanglais/M...

I’m afraid an LLM has been taken hostage by DH people. Situation is very dire: we get to the point where putting the entire pretraining corpus in TEI is explicitly mentioned.

Getting a bit frightened at time by the awareness of my own creation. MonadGPT is trained on 17th century sources but not on sources about the 17th century.

I don't want to spoil what I've done, so here's its answer to the most pressing question of our time. Not sure what to make of the fact that it demurs on this topic but was happy enough to spout facts about softball.

I have just released the early modern ChatGPT, MonadGPT, currently running thanks to generous support from @huggingface.bsky.social Any question in English or French will be answered from the perspective of someone living between 1500 and 1750. huggingface.co/spaces/Pclan...

Reposted by

Alexander Doria

So I watched the big OpenAI announcement yesterday, and the 128k context is undeniably cool. But idk about GPTs — I think I'm still more excited about this approach:

After Brahe, here comes an other form of #DH LLM: I officially release MonadGPT, a chatGPT of the 17th century. MonadGPT is a finetune of the excellent chat model Mistral-Hermes on 10,000 excerpts of early modern English, French and Latin books. huggingface.co/Pclanglais/M...

Reposted by

Alexander Doria

Our search for the best OCR tool in 2023, and what we found source.opennews.org/articles/our...

Our search for the best OCR tool in 2023, and what we foundsource.opennews.org A side-by-side comparison of five OCR tools using multiple kinds of documents, from DocumentCloud

Further announcement for #DH Bluesky: there is now a light version of Brahe, the LLM for multilingual literary annotations huggingface.co/Pclanglais/B... And it runs in the free version of Google Colab, with an official demo here: colab.research.google.com/drive/1VTi6Z...

Since the #DH community has grown a lot lately: Announcing the official release of Brahe, a fully multilingual analytic LLM for analyzing literature works. Brahe works like BookNLP and outputs as much as 20 annotations per literary excerpts. huggingface.co/Pclanglais/B...

Par esprit de contradiction, j’hésite à ne tweeter que demain et de m’arrêter le reste de l’année.

Reposted by

Alexander Doria

Priorisations, évaluations, variations ... super entretien de Ivan Yamshchikov réalisé par @dorialexander.bsky.social balayant différents sujets d'acualité, notamment le questionnement actuel sur l'évaluation des modèles. A quand des essais randomisés de modèles ? www.lebonllm.fr/ivan-p-yamsh...

Bon trop de succès avec un mon dernier don de codes sur bluesky (ce qui est en soi très parlant…). Si jamais vous en avez en réserve, je peux être preneur.

Hi #DH folks here. Do you know of any large corpus of TV series scripts?

Science-fiction hypnotique, avec une intrigue circulaire et un narrateur absolument pas fiable qui se dédouble par moment. Je ne sais pas trop j’ai fait pour passer à côté mais sûrement ma lecture du mois.

La bonne saison pour les plages italiennes (pas trop chaud, plus de plages privées, personne).

Reposted by

Alexander Doria

Reply to

Ted Underwood 🦋

I love you all dearly, but the idea that AI is going to drink all our water is to this platform kind of what 5G and vaccines are to Truth Social.

Il y a quelque mois j’espérais que le remplacement de Twitter soit juste Mastodon et non ça va être LinkedIn/Mastodon/Bluesky/Discord (et bientôt Threads ?). À ce compte-là, on pourrait aussi bien réactiver les boards des années 2000…

Post your Bluesky name and bio in AI and see what it makes of it. A try with dalle-3 and my own TintinIA model, and this is basically ideal stock photo vs. real life.

Post your Bluesky name and bio in AI and see what it makes of it.

Reposted by

Alexander Doria

American Stories, a new collection of 438 million public-domain newspaper stories from the lab of Melissa Dell. 1/2

dell-research-harvard/AmericanStories · Datasets at Hugging Facehuggingface.co We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hello World (Like it so far that this the most Twitter app among all the Twitters)

End of feed.