rahaeli: Yes, this results in some terrible people being able to jump to a new service and start their whole terribleness over again with a new audience. That's a thing to be handled through community warning posts, not labeling. The two systems are not the same.

I'm not going to litigate the specifics of this situation, but there are some critical lessons here for people who are thinking of running a labeler (and to some extent they're the lessons of T&S in general, but they are even more important given the paradigm of composable moderation).

This thread covers the two fundamental things all labelers need to decide on up front and stick to: 1) Who is doing the moderation, what are their biases, and how are those biases mitigated? 2) Are you moderating/labeling objective actions/content, or subjective characteristics?

Each of these two points have a lot (and I mean A LOT) of nuance. (Like everything having to do with T&S!) Let's start with #1: bias mitigation. People who oppose community-driven moderation are now smugly parading around going "of course anyone who wants to be a mod is biased!"

This is the wrong way to look at it. It's not an inherent problem with community moderation: it's an inherent problem with people. Everyone is biased, in a million different ways. We all have our viewpoints of what we think is good vs bad.

Elon Musk thinks the word "cis" is a slur and should be moderated: that's a bias. I think people who create accounts only to advertise things are spammers and should be moderated: bias. You may think associating a wallet name with an account name is doxing and should be moderated: bias. Etc.

T&S, inherently, is a biased process: it involves someone's definitions of what should and shouldn't be actioned. There is no such thing as neutral, unbiased moderation. Anyone who says otherwise is simply asserting societal prejudices that are declared "objective" because of who holds them.

And, crucially, people don't want moderation to be "unbiased", or to fall back solely on externalities such as "is this content legal". Don't believe me? Look at the months-long Discourse on child safety: most of the content many people very loudly want removed is legal under US law.

What people are calling "bias" here, me included (because it's shorter), is actually better termed "viewpoint". Moderation is a function of viewpoint. You choose a viewpoint lens through which to moderate and apply it to your policies and actions.

The neat thing about Bluesky's experiment in composable moderation (which, as everyone who's been following me for ages knows, I am still dubious about the long term likelihood of success of, but this is *not* the reason why) is that you can pick which viewpoint you want to view the site through.

What people starting up labelers are going to have to do, though, is work out how to ensure the agents doing the work to action reports are going to apply *the labeling service's* viewpoint and not their own. This is an incredibly, incredibly difficult problem.

The fundamental tension here: a labeler with a strong viewpoint built from the (actual or perceived) consensus of a specific group as to what should be moderated will naturally want to draw its agents from members of that group, who have a familiarity with the group's social norms and practices.

This allows contextual interpretation of reported content. Failures of cultural competency result in problems where the members of the group can easily understand why a post should be moderated, but an outsider has no idea and thinks the post is inocuous. This happens *all the time*.

However, members of the group, who will have social connections within the group and have already formed opinions and reads on people in the group, will, always, need to compensate for the human tendency to read charitably when you agree with/like the speaker and uncharitably when you don't.

Let me be very clear here: this is not an individual failing of any specific person. It's fundamental human nature. You can compensate for it when you know the tendency exists, but you can never eliminate it. I do it. You do it. Every moderator ever has done it.

There are various process methods a team manager can use to compensate for it, over and above the methods individuals can use. People commonly propose a double-agree system, where two people have to sign off on an action. That can help, but is deeply impractical at any kind of volume.

You can do escalating levels of agreement needed for more severe actions; this might look like "single agree for labeling a post, double agree for labeling an entire account". But there are problems with that, too! First: 99.9% of whole account actions will be completely uncontroversial.

So you're *still* increasing your workload for no reason. Second, since 99.9% of actions are uncontroversial, the person doing the check is going to be strongly inclined to agree with the first person on the .1% too, because they're used to most of their decisions being right.

Our brains are, fundamentally, bad at spotting .1% events. Again: this is fundamental human nature! There's only so far you can process your way out of it. Third, if everyone on the team comes from the same group, they're more likely to have the same predispositions to read charitably/uncharitably.

You can set a policy that agents should not act on reports where they already had a pre-existing opinion on one party to the conflict. That gets you closer to fixing the problem, but it, too, has issues: for one, it's entirely self-reported and relies on agents being honest about recusal.

And most people *are* honest about this! But if you have a lot of undone work and very few people doing it, the agent who has the energy to slam through a ton of reports will want to make things easy on their coworkers and reason with themselves "well, it wasn't a *strong* opinion..."

Or "but this is a *really obvious* violation" or "but we're getting so many reports about this and if we don't do something they're just going to keep piling up" or "people are demanding to know why we haven't Done Something..."

No matter how much you emphasize that the "don't act on a case where you have a pre-formed opinion" is a hard and fast rule, there will always be a point at which someone thinks they have a good reason to bend it. Often it is a very good reason! And it usually turns out fine! Until it doesn't.

Two: a ton of reports come in about people who have had past reports. Sometimes lots of past reports. Sometimes involving a long running conflict between two people that requires deep understanding of the underlying conflict.

(We occasionally see one of those long running conflicts flare up again on DW from, I shit you not, 2002. My old LJAT folks will know exactly who I mean if I say that it involves a person with the initials KD.)

Handling those conflicts requires expertise about that backstory, usually from someone who handled an earlier case. Under a pure "no cases where you already have an opinion", anyone on a team could only handle one report from a long-running feud and then they'd be conflicted out forevermore.

This is a bad and counterproductive standard! And just knowing "one of my comods took/didn't take action against this person in the past" is technically enough to make most agents form an opinion anyway, so even a one and done policy isn't enough to fully mitigate this problem!

Which leads me to three: sometimes, and especially when your team is mostly recruited from the same group, there is no person on the team who hasn't already formed an opinion about one party involved. Often not even the same opinion or the same person!

But there are a lot of reports about people within a community where everyone from that community is conflicted out because that person is just *that* well-known. It happens with very famous people (ie the Trump Twitter suspend) and well-connected people (ie highly socially networked people).

So what do you do? One option is to require unanimity from the entire team even though they're all conflicted out: that covers "we all have an opinion but they're all different opinions". For "we all hate this person", less so.

That is gonna require an outside opinion from an uninvolved person. But! Either the uninvolved person is not culturally competent, in which case they may miss things, or they are, and they have opinions of their own.

And if it's "not culturally competent/ignorant of the backstory", okay, so you write up an explainer to give them with the post you want them to gut check! Except, oops, that explainer is written with your opinion coloring it unless you're impeccably careful, and it is HARD to be impeccably careful.

There's also the problem of who do you get to do the gut check? Do they know your policies and the standards you use? Do they have experience evaluating things? Also, does your privacy policy let you share the information you have (or, if it only lets you share part, how does that bias the decider?)

These issues are fundamental to any system of moderation. Not because, as is often cynically asserted, "no one who wants to be a moderator should be allowed to do it", but because, again, of fundamental human nature. There are ways to mitigate them a little, but only so far.

When I was running the LJ team, I know for goddamn sure I screened out a ton of applicants who would probably have been perfectly capable of setting aside their personal opinions about how an issue should be handled and applying our policy instead, but I was paranoid about it.

Our entire application process was designed to tease out "is applying because they want to influence how we handle a specific type of content" in addition to "identifying the people who would be deeply fucked up by this work". It was that vigorous for a reason!

And that reason wasn't "because no one who wants to be a moderator should be allowed to be", but because literally everyone *does* have strong feelings about how some things should be handled and you have to screen hard for "who can set those strong feelings aside".

I could keep going on the issue of viewpoint, but hopefully I've shown some of the ways that -- through no one's direct and immediate fault -- it needs to be something you aggressively manage and compensate for, with multiple methods, using multiple tools. And you're still going to fuck it up.

So let's move on to issue two: the nature of the moderation you're supplying. I phrased it above as: Are you moderating/labeling objective actions/content, or subjective characteristics? Which is an incredible high level formulation of the problem, so let's break it down a bit.

Every moderation action is subjective to some extent (see previous 8000 posts in this thread, sigh.) But there's a scale. On the one end, you have objective labelers like @profile-labels.bossett.social with objective criteria: "new account" means the first post was within the last 30 days, etc.

There will never be a disagreement over "does this account have a profile pic uploaded or not". There can be cases where the label is *wrong*, because it hasn't updated yet or the objective criteria aren't a 1:1 match: it labels people who regularly delete their posts as "new account", for instance.

But you can look at the definitions it uses and go "oh, this just hasn't updated yet" or "oh, I see the technical limitation this specific account ran into". It's way out there on the "objective" end.

On the other end, you have labels that are entirely subjective. If I start a labeler that labels accounts "asshole", there's an implicitness to that: I am labeling accounts of people *I think* are assholes, using my own (very idiosyncratic, I assure you) criteria.

Labeling posts is essentially self-documenting: if the label shows only on a single post, an outside observer can look at the post and make their own determination of whether they agree or not. Labeling accounts is where it gets hard.

And a lot of potential labels fall into the great amorphous middle on the subjectivity scale. You can point to objective criteria for "spam" or "engagement hacking" or "posts without alt text" or what have you. Defining "intolerance" or "trolling" or "extremism" is harder.

Again, this is a fundamental problem with content moderation that's not unique to the composable moderation approach. Most sites, you get "the main service's definitions and that's it". Here, you can pick whose definitions you agree most with.

If you've followed me for a whole, you know what I'm gonna say here: writing policies for when to take action on an account that are as concrete as possible is fiendishly, incredibly difficult. I was giving someone a hand with policy a few weeks ago. It was one specific "here's how we handle X".

My explanation of "here's how we handle X", where X = a topic of reasonable complexity but still not the most difficult thing we define, ran around 4000 words. Our definition of "spam" is mostly "I know it when I see it" based, but if I wrote it down, it would probably run about the same.

And, crucially, *you cannot make those detailed flowcharts and definitions public*. Ever. If you do, you have just handed a weapon to the people who find it funny to walk right up to the line, stick a finger over it, and go "I'm not touching you, neener neener".

The way Bluesky composable moderation works, each labeler gets to create their own labels and define them however they want. Example: the official Bluesky moderation service defines "threats" as "Promotes violence or harm towards others, including threats, incitement, or advocacy of harm."

This one-sentence definition, I fucking guarantee you, has a similar 4000-word policy flowchart document defining every single word in that sentence and highlighting how to determine whether content meets those definitions that Bluesky will never (and should never) make public.

Take the word "promotes"! We use that one a lot in our internal policy, too. (Along with the closely related but not identical "advocates".) Our definition of promotion is "any content or behavior that is intended to influence the audience to commit an action".

Our definition of advocacy is "any content or behavior that is intended to persuade the audience that a particular action is desireable, useful, or beneficial". Are those the definitions Bluesky uses? I have no idea!

Are those definitions subjective as fuck? Absolutely! Are they still as concrete and specific as I could get them? They sure are. Have we had multi-day debates about whether specific content meets them? Hell fucking yeah we have.

This is what content moderation is. This is *almost entirely* what content moderation is. "We do not remove content that X, unless it also meets condition Y (which is defined as Z) or condition Q (which is defined as S.)"

"If 51% of the recently posted content (defined as D) in the account meets the standard of E, we remove the account. Otherwise we remove the specific content. The exceptions are F, G, and H, where even if under 50% of recently posted content is E, we will remove the account."

"However, if under 50% of the recently posted content is E, condition G applies, but exception J, we will not remove the account and only remove the specific content." Etc, etc, etc, ad infinitum, on a fucking million questions and a fucking million policies.

Our concrete policy on assisted account recovery (when you've lost your password and lost access to all your old email addresses) is so complicated that I finally gave up and started putting explicit brackets around clauses so it was easier to tell which bits went with which clauses.

If you are starting a labeler, this is the kind of thing you need to do for every single thing you label. Write the summary first. Then define all the terms you used in the summary. Then start digging into exceptions and exceptions to the exceptions.

What does this policy take down? And, more importantly: what does this policy leave up? What's the most awful thing you can think of that someone could say that would NOT cross this line? What's the *least* awful thing that would? What are the objective differences between those two?

If you can't find an objective difference between the most awful thing that would not be taken down and the least awful thing that would, look again. Look hard. Look *really* hard. You're working on vibes: what's influencing the vibes? What is the difference your gut is trying to tell you?

Now put the policy with those explicit examples and differences and definitions away for a week, take it back out on a day when your mood is drastically different than it was on the day you wrote the first draft, and *run the examples again*. Did your answers change?

If you're already getting reports, save your 10% of reports that are most borderline. Put them away for a week. Without looking up the original outcome of the report, apply the policy again. Then look up the original outcome. Did you get the same answer? Did you take the same action?

And if you didn't: why? What was different between then and now other than your mood that day? Now hand that same list of reports to someone else on the team and have *them* apply the policy. Did they get the same answer? Did they take the same action?

If not, ask them why: what part of the policy were they applying to arrive at their answer, and what did they read that policy to mean? Is it that the definition in the policy isn't clear, or they have different interpretations, or *they're* in a bad mood that day?

If, after you go through all those steps a few times and revise the policy to make it clearer, you are still disagreeing with your past self or your other team members are still disagreeing with you, more than about 10% of the time: your label is a vibes based label, not an objective label.

This is not inherently a bad thing! The advantage of composable moderation is that there is room for vibes based labels and labelers. Maybe some people want to subscribe to a labeler that is only "people rahaeli thinks are assholes"! But don't mix vibes-based labels and concrete labels.

If I had a labeler for "people rah thinks are assholes", it would be a bad idea to also have it label "accounts rah thinks are spam". I can concretely define "spam". (In a lot of words, but I can conceretely define it.) My definition of "asshole" is "did they annoy me". That's vibes.

Every existing labeler except for the profile labeler I linked to upthread has a mix of vibes-based labels and non-vibes based labels. Take the official Bluesky mod service, @moderation.bsky.app. "Engagement farming" and "spam": can be concretized. "Rude": entirely vibes.

Mixing the two is bad! If you want to label both, and you're not Bluesky itself (since there are concerns about keeping it all in one place to make it default), make two labelers, one for your vibes labels and one for your concrete labels.

And finally, the last lesson from this incident, which I did not call out specifically at the top of the thread because it's kinda related to both of the fundamental issues: moderation based on off site actions, screenshots, extrinsic evidence, personal accounts and testimony, etc, etc.

Unlike most of the things I'm discussing here, this one does have a very simple answer: Don't. Unless it's one of a few very narrow exceptions, do not ever act on offsite behavior or evidence. It's too easy to fake the evidence AND you have no way of conclusively proving it's the same person.

The two exceptions: on-site solicitation to join off-site CSAM trading or distribution networks, and "incitement to harass": linking to someone's accounts offsite *and* encouraging, explicitly or implicitly, one's readers to follow that link and engage with the person.

There are *some* times when it can be useful to act on off-site activity, such as proactively labeling the account of someone who is widely known for advocating and promoting intolerance on other social media IF they confirm on that other social media that their account on your service is theirs.

There are also times when it can be useful to use offsite information to help give you background information or help you understand the conflict. But for "offsite information" there I mean "information still in place in the original location and the poster confirms the accounts are both theirs".

But you have to be *incredibly* careful with that. Except as above, never use it as the deciding factor in a judgement. Never use it as your sole evidence. Moderation on a site has to be about behavior on the site itself. Yes, this is exploitable, yes, terrible people use it to skate by.

But using offsite information is a fucking hand grenade, because you don't have access to the tools on that other site. You don't have access to the metadata or the context. And, in the case of screenshots, it is *trivial* to fake them.

I have gotten SO MANY fake screenshots proffered as "evidence", faked with everything from Photoshop to "inspect element, change some words in the source, screenshot the results". It takes seconds to do. Sure, the majority of screenshot receipts are real, but that can be a *very slim* majority.

You will never detect some of the manipulation. You will never be able to authenticate the screenshot. You do not have subpoena power to put someone under oath to authenticate them under penalty of perjury. Just make it easier on yourself and don't.

Yes, this results in some terrible people being able to jump to a new service and start their whole terribleness over again with a new audience. That's a thing to be handled through community warning posts, not labeling. The two systems are not the same.

Trying to use a labeling system to handle all forms of community self-defense is going to have massive failure modes. Sometimes people are shit offline and behave perfectly fine online for a really long time. As much as it sucks, and it does fucking suck, that's not what moderation is for.

Moderation on a site needs to be about someone's behavior on that site. Anything more than that and you run into a billion wicked problems that end in major explosions.

Unless I think of anything else, here endeth the thread.

rah you are extremely good at breaking all this down and i appreciate it greatly. fantastic thread. i do not miss moderation work lol

Yeah really. Whenever something like this comes down my first reaction is "thank goodness I don't have to deal with moderating this." And great thread as always @rahaeli.bsky.social, a lot of learning is possible here (and we'll see if that happens).

"To summarize the summary of the summary: people are a problem." - Adams

Thanks for your thoughts and time, very insightful

Have you thought of writing a book on moderation? (Perhaps you did, IDK)

Many, many, many times, have several outlines half-drafted, keep arguing with myself about the focus

etc. It could be the Knuth of social media :)

"keep arguing with myself about the focus" fitting to worry about scope creep for a book about content moderation... *ba-dum tssss*

Maybe just collect (and expand upon as you see fit) the long threads and call it something like "Nuggets From the Trust & Safety Mines." >_>

I think you could toss some darts at a topic list, and write something worth reading. Best of luck to this.

Just accept that part of the title is gonna be "Volume 1" 8-)

If you are interested in books on moderation, here is a good list socialmediacollective.org/reading-list...

Content Moderation: A Reading Listsocialmediacollective.org The study of content moderation by social media platforms has exploded in the last few years, paralleling the attention finally being paid by journalists, lawmakers, and users. It is a vital concern i...

This was thorough (oh fuck, was it thorough), well thought out, and presented without the vitriol that has defined most of the recent Discourse. Very well done, and I hope you got some rest after dropping that thread.

I went back to sleep! I have a lot of notifications now!

No good deed goes unpunished 🤣

As usual, thank you for the insight. I was waiting for this thread as soon as I saw the Aegis announcement.

Lol so was I. And it was worth the wait!

Thanks for writing all this out! I’ve been wondering what you would have to say about some of the recent events. Also, I’m glad good T&S people like you exist. The internet would be so much less usable without y’all.

I learned so much from this thread! I agree with the commenter above who asked about a book. Anything from a memoir to a beginners guide to moderation, or something really technical, I'd want a copy!

I learned a lot from this thread, thank you for taking the time to write it!

This was really interesting and well explained, thanks!

I am very glad you and other people like you think all of this through so thoroughly. While I greatly appreciate the results and am interested in learning about the topic, I am ill-suited to doing this kind of work myself. I have so much appreciation for you writing all of this out! ❤️

Thanks for this. Hard earned lessons, I'm sure, and it overlaps with the little I know.

Thank you. This thread was obviously a lot of work, and I appreciate you doing it.

god damn that's several things I never considered in community moderation work. thank you for this!

📌

I got to the part of the Aegis announcement where they said future labelling teams need to be plugged in to whisper networks and flinched. No no that is the opposite of how to do moderation.

Right? I do not want to be told that beancheeseforever is actually Jane Doefawn the notorious silent farter, because, 1, sez who, and 2, why should that influence how I react to her on this site? (And 3, from *their* perspective, what if the [stinky] label makes me *more* inclined to follow her?)

I wonder if, in communities that are prone to being assaulted and victimized (notably the queer and trans communities), there is some value in this kind of labeler? If yes, it would necessarily need to be a separate labeler focused exclusively on this kind of scenario.

I can absolutely see the value in it, but problem number 1, "sez who", is still a *big* problem.

I've even been the direct target of that. There's a specific adverse medical experience I had that is rare but I'm far from the only one, but in whisper networks it became that anyone saying that is me and lists of past activity across various sites as proof when much of it wasn't me.

Also an excellent way to find yourself running a gofundme for a legal defense fund for yourself _and your sources_. With all due sympathy to the good intentions here: no, do not do this. Whisper networks are sub-rosa for _good reason_.

It’s dicey enough calling people out who have an amply documented including an arrest record of bad behavior — just ask any user of the old livejournal davis square community.

(This is really one of the primary reasons I’m a little pessimistic about bluesky’s composable moderation experiment: the more that a labeling team looks like a professional org with procedures and bylaws and transparency, the more it’s going to look like an organization with legal liabilities.)

Ah, jonmon. At least he gave us some useful court precedent?

SLOWLY I TURN

Oh good, it wasn't just me That just seems like overt anti-transparency

Post