Top 20 worldwide with social-engineering and a cheat that's still undetected

tl;dr

using math and reinforcement learning, we built a cheat that works on literally every VSRG - osu!mania, etterna, quaver, stepmania, you name it. no game-specific code. just math and a trained model that learned to read scrolling notes and output perfectly timed keystrokes.

it worked too well. top 20 worldwide on osu!, then banned. even funnier than that, they still had no proof the scores were cheated, and could only ban us for multi-accounting suspicions.

the best part? we social-engineered our way into the community so hard that top 100 players were vouching for us being legit. people analyzed our replays and said “yeah this looks human.” whilst we never even pressed a single key.

introduction

this is the story of how we spent a weekend building something we probably shouldn’t have, got way better results than we expected, and confirmed that the hardest part of cheating isn’t fooling the software, it’s fooling the people.

it started as a joke. “what if we made a bot that could play osu!mania?” turned into “what if it worked on every rhythm game?” which turned into “what if we actually tried to rank?” which turned into high ranked accounts in every rhythm game, a top 20 global ranking, friendships with players who had no idea they were befriending a python script, and eventually a ban that proved we’d done our job almost too well.

we’re not releasing the code. we’re not here to help you cheat. we’re here because these were really interesting technical and sociological problems we’ve worked on, and the intersection of math, machine learning, reverse engineering, and straight-up social engineering is too good not to document.

also, statute of limitations on internet drama is like 6 months… right?

so, what even is a VSRG?

VSRG stands for Vertical Scrolling Rhythm Game. if you’ve ever played guitar hero, dance dance revolution, or beatmania, then you’ve played a VSRG. notes fall from the top of the screen, you press the corresponding key when they hit the bottom. simple.

the competitive scene is… not simple.

games like osu!mania, etterna, quaver, and stepmania have global leaderboards, ranked play, and communities that take timing accuracy and legitimacy very seriously. we’re talking millisecond-level precision. the difference between a “marvelous” and a “perfect” hit can be 16ms. the difference between top 100 and top 1000 is consistency across thousands of notes.

here’s the thing that made this project possible: every VSRG is fundamentally the same problem.

notes scroll down. you press keys. the visual style changes. between games, it can be stylized as arrows, circles, bars, whatever… but the underlying mechanic is identical! timing windows vary, scoring formulas differ, but the core loop is:

note appears
note travels toward hit zone
player presses key at the right moment
game judges timing accuracy

which means if you can solve this problem once, mathematically, you can solve it for every game in the genre.

this seems like an easy problem to solve.. right? well, not exactly.

the naive approach (and why it doesn’t work)

“just OCR the screen and press keys lol”

technically, this fails for a few reasons:

latency kills you (screen capture → processing → keypress = 30-50ms minimum)
timing windows in competitive VSRGs are 16-20ms for perfect scores
note patterns overlap, colors lie, skins vary wildly

we tried this first. it was bad. like, worse than average bad.

but here’s the thing most people don’t realize: even if your bot plays perfectly, you still lose.

the human anti-cheat

osu! has actual anti-cheat software, sure. (albeit old, you can still see some of vmfunc and pushfq’s research on their ac here) but the real threat isn’t their anti-cheat really, it’s the community.

ever heard of r/osureport? it’s a subreddit dedicated entirely to hunting cheaters. thousands of players spending their free time analyzing replays, tracking improvement curves, comparing timing distributions, and writing detailed reports on anyone who looks suspicious.

they have spreadsheets. they have bots that automate report tracking. they categorize cheats by type: relax hacks, timewarp, macros, aim assist, multi-account, and whatnot. they track ban statuses. it’s genuinely impressive how organized they are at ruining cheaters’ days.

and it’s not just reddit. set a good or a top score on a ranked map and you’ll have people in your DMs within hours. “hey nice score, can you stream it?” “your UR is crazy consistent, what’s your setup?” “I noticed your improvement curve is insane, how long have you been playing?”

every replay is public. every score can be downloaded and frame-by-frame analyzed. people will literally overlay your cursor movements on top of known cheaters to look for patterns. they’ll run statistical analysis on your hit timing distributions. they’ll compare your offline vs online performance.

the osu! community has mass-reported accounts into bans purely based on “this doesn’t feel right.” no concrete proof, just collective suspicion. and staff often listens.

so the real challenge isn’t just “play the notes correctly.” it’s:

play the notes correctly
look human while doing it
have a believable improvement curve
survive community scrutiny
don’t get ratio’d or doxxed while doing so

this is why most cheats get caught. not because the software detected them but because a bored 16 year old with too much free time noticed something was off. or because you beat the score of another top player who took the piss.

so how do we even achieve this?

you don’t need to see, you need to predict

the notes are deterministic. the chart is the chart. if you know the scroll speed and the current timestamp, you can mathematically derive where every note will be at any moment.

the problem becomes:

figure out the scroll speed (calibration)
figure out the current song position (audio sync)
predict note arrival times
compensate for system latency
press keys at the right moment

this is just calculus. and a little signal processing.

but wait… remember the human problem? pure math gives you perfect inputs. perfect inputs get you banned in 24 hours.

learning from the masters

here’s where it gets interesting. osu! lets you download replays. any replay. from anyone. including the top 50 players in the world.

so we did exactly that. we scraped thousands of replays from top 100 players across different skill brackets. not just their best plays, their mediocre ones too. the ones where they choked. the ones where they were warming up. the ones at 3am when they were using different setups and keyboards.

from each replay, we extracted:

hit timing distributions (how early/late they hit notes)
timing variance patterns (do they get more inconsistent on dense sections?)
release timing on long notes
error clustering (do mistakes come in bursts or randomly?)
per-finger timing differences (index vs middle vs ring finger patterns)

turns out humans are incredibly consistent in their inconsistency. top players don’t just have better accuracy but they actually have signature error patterns. you shouldn’t interpret their timing distributions as random noise, they’re actually shaped by muscle memory, fatigue, and playstyle.

so in theory, if you can mimick that, while also mimicking a natural pattern of improving after playing a chart over and over again, or choking every now and then, you’re basically just a human to the external eye.

we now had a dataset of what “legitimate top player” actually looks like, statistically. the goal wasn’t to play perfectly but it was to play exactly like a human would if that human had god-tier reflexes.

the math (for nerds)

alright, let’s get into the actual implementation.

detection

we sample single pixels at the hit zone for each column. for a 4-key layout, that’s 4 regions:

var regions = [4]image.Rectangle{
    image.Rect(733, 922, 734, 923),
    image.Rect(884, 921, 885, 922),
    image.Rect(1027, 929, 1028, 930),
    image.Rect(1182, 922, 1183, 923),
}

each frame, we capture these 1x1 pixel regions and check if a note is present via color matching:

c := img.At(0, 0)
r, g, b, _ := c.RGBA()

// tap note detection (blue-ish: RGB ~148, 157, 253)
dr, dg, db := abs(r-148), abs(g-157), abs(b-253)
tapDetected := dr <= tolerance && dg <= tolerance && db <= tolerance

// long note detection (white-ish: RGB ~230, 227, 228)  
dr, dg, db = abs(r-230), abs(g-227), abs(b-228)
holdDetected := dr <= tolerance && dg <= tolerance && db <= tolerance

simple color thresholding with a tolerance of ~3. works surprisingly well across different skins as long as you calibrate the target colors.

the humanization layer

here’s where it gets interesting. we track KPS (keys per second) in real-time and use it to dynamically adjust timing variance:

var (
    DevianceMax float64 = 50.0  // max random delay (ms)
    DevianceMin float64 = 10.0  // min random delay (ms)
)

func Deviate(Min, Max float64) {
    Delay := rand.Float64()*(Max-Min) + Min
    time.Sleep(time.Duration(Delay) * time.Millisecond)
}

humans get less consistent when playing faster. so we scale deviance with KPS:

AddedDeviance := CurrentKPS / 2.5
NewDevianceMin = DevianceMin + AddedDeviance
NewDevianceMax = DevianceMax + AddedDeviance

// final delay calculation
Deviate(NewDevianceMin*2, NewDevianceMax/float64(Clamp(int(CurrentKPS), 0, 30))/1.5)

at low KPS (slow sections), timing is tight and consistent. at high KPS (dense streams), variance increases. this matches how real players behave, obviously you can’t maintain perfect consistency at 20+ KPS.

long notes get different treatment though, smaller deviance on press/release because humans are actually more consistent with holds:

// long note press
Deviate(NewDevianceMin/3.5, NewDevianceMax/6.7)

// long note release  
Deviate(NewDevianceMin/3.5, NewDevianceMax/6.5)

but random noise isn’t enough

pure math gets you accuracy. but leaderboards also care about:

release timing (for long notes)
micro-adjustments for “humanization”
not looking like a bot (variance in timing distributions)

uniform random distribution doesn’t look human. humans have biased error distributions, they tend to hit slightly early or slightly late depending on playstyle, and errors cluster in patterns.

this is where the RL came in.

training the humanizer

we had thousands of replays. we had a working bot. now we needed to make them talk to each other.

the setup

we framed this as a policy optimization problem. the agent’s job, given a note timing, had to output a delay that maximizes “human-likeness” while staying within scoring windows.

state space:

current KPS (rolling average)
note density in the next 500ms
time since last input on this column
current combo length
position in the song (early/mid/late)

action space:

continuous delay offset in [-30ms, +30ms]
per-column bias adjustments

reward function:

def reward(timing_error, human_distribution, combo_maintained):
    # penalize being outside the timing window
    if abs(timing_error) > TIMING_WINDOW:
        return -10.0
    
    # reward matching human timing distribution
    human_likelihood = human_distribution.pdf(timing_error)
    
    # small bonus for maintaining combo (don't miss)
    combo_bonus = 0.1 if combo_maintained else 0
    
    # penalize being TOO perfect (sus)
    perfection_penalty = -2.0 if abs(timing_error) < 1.0 else 0
    
    return human_likelihood + combo_bonus + perfection_penalty

the perfection_penalty was crucial. without it, the agent learned to hit perfect timings every time, which is exactly what gets you reported.

training data

for each replay in our dataset, we extracted:

the timing offset of every hit (ground truth human behavior)
the context at that moment (KPS, density, combo, etc)

this gave us ~2 million training samples across different skill levels, playstyles, and song difficulties.

the model

model wise it’s nothing fancy, a small feedforward network (3 hidden layers, 128 units each) that outputs mean and variance for a gaussian distribution. we sample from that distribution to get the actual delay.

class HumanizerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(STATE_DIM, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 128)
        self.mean_head = nn.Linear(128, ACTION_DIM)
        self.std_head = nn.Linear(128, ACTION_DIM)
    
    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        mean = self.mean_head(x)
        std = F.softplus(self.std_head(x)) + 0.1
        return mean, std

trained with PPO for ~500k steps. the loss curves were boring (good sign) and the final policy produced timing distributions that were statistically indistinguishable from our reference players.

validation

we generated 1000 synthetic replays and compared their timing distributions against real replays using the Kolmogorov-Smirnov test. p-values > 0.05 for all columns and KPS brackets.

timing distribution graph showing pressing time vs count for 4 keys

in english (for the non math nerds out there) if you looked at a histogram of our bot’s timing errors vs a real player’s timing errors, you couldn’t tell them apart.

here’s what it looks like in action:

making it universal

game 1 uses arrow skins. game 2 uses circles. game 3 uses bars. game 4 uses rectangles. who cares?

the core engine doesn’t change. the humanizer doesn’t change. the only thing that changes per-game is:

detection calibration - different pixel coordinates, different note colors
timing window tuning - some games are stricter than others
input method - SendInput works for most, but some games need different hooks

we built a simple config system. point it at a new game, calibrate the detection regions, tweak the timing parameters, done. same bot, different skin.

tested on: osu!mania, etterna, quaver, stepmania, malody, even roblox, and a few others we probably shouldn’t name. worked on all of them.

yes, even roblox:

the climb

with a working bot and a humanizer that passed statistical tests, it was time to see how far we could push it.

we spent around ~5000 dollars on accounts on various games, osu!, etterna, quaver, roblox, you name it.

then we set some rules:

never claim #1 (too obvious)
lose occasionally on purpose
only play on alts
build a persona (join discords, chat with people, be a person)
gradual improvement (no overnight rank spikes)

the last one was the hardest. we couldn’t just jump from unranked to top 100 in a week. so we tried two different routes.

the first one, scripting a fake improvement curve start with easier maps, gradually increase difficulty, occasionally plateau, sometimes regress slightly. just like a real player learning the game.

the second one, directly jumping to insane ranks and using our profiles in other games to prove our legitimacy and claim that we came from other games, since skills are transferrable.

we even made the bot “practice” maps. play the same song multiple times with slightly improving scores each time. leave realistic gaps between sessions. take breaks on weekends.

and then we started climbing.

first it was top 10,000. then top 5,000. then top 1,000. every milestone felt surreal. we were watching a program outperform thousands of real humans who had spent years developing their skills.

top 500. top 200. top 100.

at this point on some games like roblox, people started noticing. “what’s your practice routine” “can you stream sometime”

we couldn’t stream, obviously. so we made excuses. bad internet. anxiety. broken webcam. the community bought it. they had no reason not to.

top 50. then top 20.

we were in the leaderboards next to players we’d watched for years. players whose replays we’d scraped to train our bot were now competing against that same bot. the irony wasn’t lost on us lol

top 100 players vouched for us. people analyzed our replays frame-by-frame and said “yeah i mean this looks legit”. at some point on quaver someone even invited us to a tournament. we politely declined (scheduling conflicts, yk…)

we were living a double life. and honestly? it was kind of exhausting.

how we got caught (sort of)

here’s the thing: we didn’t really get caught. not in the way you’d expect.

out of all the accounts across all the games, only two osu! accounts got banned. the rest? still standing. etterna, quaver, the roblox games? all still up. we’ll be manually deleting them now that this post is out, since that was always the plan once the experiment was over.

so what happened with those two osu! accounts?

we got cocky. we started sniping top players. taking their #1s on maps they cared about. that’s when people started paying attention.

the thing is, we had a cover story ready: “i’m coming from etterna/quaver, just switched to osu!mania.” this is actually a legitimate thing that happens, as players migrate between VSRGs all the time, and their skills transfer. it’s not suspicious to be good at osu!mania if you’ve been grinding etterna for years.

we even got top players to vouch for us. showed them our replays, chatted with them about technique, talked about our “background” in other games. claimed we were specifically good at X or Y playstyle, some analyzed our scores and said “yeah this tracks, the skillset transfers”… social engineering at its finest.

since both of us actually knew how to play VSRGs, we could handle conversations with top players just fine without struggling to use terminology or knowing things only good players would know.

and here’s our strongest argument: we couldn’t possibly be cheating in EVERY game.

the community’s logic was simple, there are no publicly available cheats that work across osu!, etterna, quaver, AND roblox rhythm games. surely if someone was cheating, they’d only be cheating in one game? the idea of a universal VSRG cheat seemed absurd. “there’s no way someone has a cheat for all of them”

turns out we did. but nobody believed that was even possible.

reddit thread showing community skepticism about universal cheats existing

the irony of this screenshot still kills us. “i just so happen to have a cheat that works on every game including quaver etterna osu roblox” said sarcastically, as if that’s an absurd claim. reader, it was not absurd. we had exactly that.

but here’s the thing about sniping top players: they get salty. and salty players report. a lot.

the anti-cheat never flagged us. our replays passed every statistical test the community threw at them. the timing distributions looked human. the persona was solid.

what got us was the cockiness which led to a report. not with evidence (as all evidence they provided got disproven by other top players, including graphs!), just “this doesn’t feel right” from enough people that staff had to do something.

the official ban reason? multi-accounting and stolen account suspicion. not cheating. they couldn’t prove cheating because, statistically, there was nothing to prove. our replays were indistinguishable from human replays.

they knew something was off. too good, too fast, too many people complaining. but they couldn’t prove what.

in the end, the cheat was never detected. the human behind it was just too annoying.

and honestly? that felt like a win.

what we learned

rhythm games are a solved problem - mathematically, there’s nothing stopping a bot from achieving perfect play. the timing is deterministic, the inputs are simple.
the hard part is the human problem - anti-cheat software is easy to bypass. the community is not. thousands of players actively hunting for cheaters is a more effective deterrent than any automated system.
statistical humanization is surprisingly effective - with enough training data and the right model, you can generate behavior that’s indistinguishable from human behavior. scary implications beyond rhythm games.
social engineering is underrated - half our success came from building a believable persona in various games. the technical cheat was good, but the social cover was what let us climb for months.
we spent way too much time on this - could’ve learned a real skill. instead we learned how to fake one. worth it? probably not. fun? absolutely.

why we’re writing this

we’re not releasing code and we’re not here to help people cheat.

this is documentation of an interesting technical problem! the intersection of signal processing, reinforcement learning, game hacking, and social engineering.

the rhythm game community will probably hate this post. fair honestly. and we’re sorry. but the problems we solved here have applications beyond cheating at video games: fraud detection, bot detection, behavioral biometrics, and much more! understanding how to fake human behavior helps you understand how to detect fake human behavior.

also we’re already banned and are going to delete all our other accounts used for testing. what are they gonna do, ban us again?

closing thoughts

we built something that shouldn’t work as well as it did. we climbed higher than we expected. we made friends who weren’t real and rivals who didn’t know they were competing against code.

in the end, we got caught not because the cheat failed, but because humans are better at detecting humans than software ever will be. the bot was perfect. the person it was pretending to be wasn’t convincing enough.

there’s probably a lesson in there somewhere.

thanks for reading. don’t cheat at video games. it’s not worth it.

but if you’re going to do it anyway, at least do your own research, write your own stuff, and make it interesting