Default-on Since April 24: GitHub Trains Copilot on User Code

Default-on Since April 24: GitHub Trains Copilot on User Code

Part 1 of a three-part series on GitHub alternatives.

Since April 24, 2026 GitHub Copilot is operating under new house rules. Anyone working on a Free, Pro or Pro+ plan now contributes their inputs to AI training by default. Anyone who doesn’t want that has to actively opt out. Business customers are exempt — and that’s the actual story.

Defaults are the most powerful form of behavioral steering, because they work invisibly. No one clicks them away because no one sees them. GitHub used exactly this property when it published the announcement on March 26 and gave the market four weeks to get used to it. The switch was flipped on April 24. Anyone who does nothing under settings at /settings/copilot/features now sends code context around the cursor position, prompts, file names, navigation patterns and comments to the platform’s model pipeline — a fairly complete description of what a developer is currently working on.

The legal staging is clean: timely announcement, documented privacy FAQ, clearly labeled opt-out switch, even the grandfather rule that earlier preferences are honored. Anyone who had previously deactivated the more general “use my code snippets to improve products” switch keeps that preference. What still makes this change a case study is not a procedural error. It’s the underlying construction: default-on for individual users, default-off for paying business customers.

Who Gets What

The dividing line runs between tiers. Copilot Free, Pro and Pro+ — the individual accounts — give up interaction data. Copilot Business and Enterprise do not. According to the FAQ, the data may be shared “with GitHub affiliates,” which explicitly includes Microsoft as the parent company; OpenAI and Anthropic are not supposed to receive it. So much for the official position.

What stands out isn’t what happens in detail, but what the segmentation itself says. Microsoft acknowledges two things with this. First, that the processing is sensitive enough to protect business customers from it — if it were inconsequential, business tiers wouldn’t need to be exempt. The existence of the exception is the admission. Second, that the individual user layer is calculated as training material. Not incidentally, not as a side effect, but as a platform decision. Anyone working privately on a hobby project, side gig or OSS contribution with a Pro account is, by default, supplying training data for the models that are then marketed to business customers.

For German mid-market developers the constellation is particularly uncomfortable. Anyone who works in the office with Copilot Business and at home with Copilot Pro — a setup whose frequency is not to be underestimated — lives between two worlds with different default assumptions about data flow. The mental separation is left to the developer.

How the Switch Became What It Is

The backstory explains a lot. Copilot launched as a public preview in June 2021, with the underlying OpenAI Codex model trained on public GitHub code without systematic regard for license terms like GPL or Apache 2.0. November 2022 brought the Doe et al. v. GitHub lawsuit, parts of which are still active. In 2023 GitHub clarified that private repositories would not be drawn upon for Copilot training — a self-commitment, not an enforceable promise, and one that concerned the training dataset, not every processing layer.

Between 2023 and 2025 the general switch “use my code snippets to improve products” existed. What fell under “improve” was defined by GitHub: bug triage, telemetry aggregation, ML models for search ranking, prompt tuning — none of it explicitly excluded. In March 2026 finally came the step now under discussion. The vague improvement clause turned into an explicit toggle, naming AI training by name, with a clear default-on position for Free, Pro and Pro+.

The trajectory is recognizable: first the vague promise, then the explicit processing called by its real name. Legally this is cleaner than the old solution, because consent is at least formally informed. But it’s also a signal that Microsoft now wants to declare data usage openly instead of subsuming it under “service improvement.” An implicit practice becomes a declared one.

What Default-on Means Legally

Default-on for sensitive processing is not unproblematic. The GDPR requires a “clear affirmative act” for consent (Art. 4 No. 11), and the CJEU clarified in Planet49 (2019) that pre-ticked boxes do not constitute valid consent. GitHub argues the opt-out switch is not a GDPR consent but a contractual usage rule based on legitimate interest or contract performance. The construction is possible but contestable. When a processing purpose is as broadly framed as “training data for AI models, shared with the Microsoft corporate group,” the legitimate interest on the provider side becomes a general power of attorney.

The dispute will be fought once the first data protection authorities take action. Lead authority for Microsoft is the Irish DPC, which historically slows proceedings. Until a final ruling — realistically 2028 at the earliest — the default-on state applies to everyone who doesn’t actively change it.

From a compliance perspective, the situation needs to be reassessed for companies that allow employees to work with private GitHub accounts. Anyone writing code for a corporate OSS project from home with a private Pro account is now sending interaction data to GitHub that may be shared with Microsoft. Whether the records of processing activities and the DPIA cover that is a question that data protection officers will need to answer very quickly.

The Convenient Reflex and the More Honest Reasoning

Here’s a clarification that’s part of being honest. The US-hosting argument — CLOUD Act, Schrems II, FISA 702 — is real and legally serious. But it only goes so far as a sole migration trigger for companies that already use AWS, Azure, Cloudflare and similar US services in their infrastructure. Anyone running their web infrastructure through Cloudflare and their data pipelines on AWS cannot consistently argue that GitHub specifically is untenable because of its US base.

The more honest reason is a different one. Code is more sensitive than web traffic or object storage. Code is active business foundation, often a trade secret, sometimes even decisive for competition. At AWS bytes get stored, at Cloudflare HTTPS gets TLS-terminated. At GitHub, since April 24, a platform has the right to use code inputs for model training — provided the account is on the Free, Pro or Pro+ tier and hasn’t actively objected. That’s a different class of trust decision.

The data protection aspect remains relevant, but as an amplifier of a more specific argument. Default-on for training use is the primary boundary line. CLOUD Act and similar third-country issues are a secondary factor — they sharpen the consequences of a default-on decision, because data can then end up not just within a provider’s corporate group, but also within the reach of state agencies.

Three Levers in the Toolkit

Three legal instruments are worth knowing because they apply even when the platform stays American.

The first is the TDM reservation under Art. 4 DSM Directive / § 44b German Copyright Act. EU law allows text and data mining for commercial purposes only if the rights holder has not declared a machine-readable reservation. A prose clause in the README is not sufficient under prevailing interpretation. Concretely it requires headers, robots.txt entries, ideally an ai.txt. More on this in Part 2.

The second is the EU AI Act, in force since August 2024. Art. 53 obliges providers of so-called general-purpose AI models to training-data transparency and to respecting TDM reservations. The sanctions regime kicks in from August 2026, four months after the GitHub change. Compliance with TDM reservations thus becomes relevant not only under copyright law but also under product law.

The third is the GDPR via metadata. Code itself is usually not personal data, but commit metadata and interaction data certainly are. Name, email, IP, timestamp, cursor position over time — that’s Art. 4 material. A provider using these for its own purposes such as model training needs its own legal basis; a data processing agreement is not enough.

The three levers don’t add up to a complete package, but to a toolkit with which the default-on mechanism can at least be selectively neutralized.

From the Switch to the Platform Question

The change of April 24, 2026 is an occasion to think about the structural defaults of the platform — not just about this one switch. Commercial competitors like GitLab, Bitbucket or Azure DevOps have structurally similar mechanisms in place, with different defaults and differently transparent AI integrations.

Part 2 introduces the alternatives: Codeberg and Forgejo as the most mature default-off platforms today, GitLab CE as a self-hosting variant with a guaranteed deactivated AI layer, plus Sourcehut, Gogs, Launchpad — and as a contrast AWS CodeCommit, which has stopped accepting new customers since July 2024. Part 3 turns operational.


Translated with the help of Claude.

This series:

Sources: