NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
By: cryptosheadlines|2025/05/08 12:00:08
0
Share
Airdrop Is Live CaryptosHeadlines Media Has Launched Its Native Token CHT. Airdrop Is Live For Everyone, Claim Instant 5000 CHT Tokens Worth Of $50 USDT. Join the Airdrop at the official website, CryptosHeadlinesToken.com Joerg Hiller May 07, 2025 15:38 NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training. NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.Advancements in Data CurationThe Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.Innovative Pipeline FeaturesThe pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.Impact on LLM TrainingTraining LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.Getting Started with Nemotron-CCThe Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.For more information, visit the NVIDIA blog.Image source: Shutterstock Source link
You may also like

Morning News | Invesco acquires a $900 million on-chain fund from Superstate; ParaFi has raised $125 million for its new fund; Solana Foundation launches developer platform SDP
Overview of Important Market Events on March 24

What is the background of this new fund that the two major prediction market platforms have rarely joined forces to create?
When Klashi's early employees went out to raise funds, the two CEOs chose to appear together on the list of investors.

SIREN, another leveraged scam
What kind of experience can we gain from these similar situations?

Token has become extremely popular, and the blockchain is very sad
When AI's tokens become the new "digital oil," blockchain can only watch its once-dreamed dreams materialize in a completely unfamiliar way. This misaligned popularization is a victory for AI, but also the deepest helplessness for blockchain.

Tether's major shareholder invests £12 million to support the "British version of Trump" in the cryptocurrency sector
In the United States, the story of the cryptocurrency industry pouring money to support Trump and reclaiming regulatory dominance has come to an end. In the United Kingdom, the same script is being replayed.

Huang Renxun's Latest Podcast: Will NVIDIA Reach $1 Trillion? Will the Number of Programmers Increase Instead of Decrease? How to Deal with AI Anxiety?
Hashpower will determine everything; human work will only be restructured, not disappear

Besides Resolv Hack, This DeFi Vulnerability Type Has Occurred Four Times
17 minutes, 100k turned into 25M.

Trump Cries Peace, $1.5 Billion Dash | Rewire News Evening Brief
In the first 15 minutes of trading, $1.5 billion in futures trades have already taken place

From x402 to MPP: Cloudflare's crucial vote, will it go to Coinbase or Stripe?
Cloudflare is both building walls and opening windows. It provides both blocking tools and paid access tools. They decide what is kept out, what is allowed in, and under what conditions it can enter.

BlackRock CEO issues annual open letter: The wave of tokenization has arrived, and we will lead this trend
Rebuild capitalism that belongs to everyone.

When Backpack backstabs the community
Once a fundamental rift in trust appears, the cost that Backpack must pay to repair it is likely far more expensive than the profits previously "harvested" through service fees.

When gold is no longer a safe haven, and Bitcoin continues to panic
The whole world is waiting for the Strait of Hormuz to reopen. Why not guess which type of asset will return to pre-war levels first?

Trump, the World's Largest Oil Trader
No matter the outcome, he will not lose money.

If the US and Iran have not reached an agreement in 5 days, what other cards does Trump have?
A $100 Brent implies an approximate 30-40% "strike probability".

Tether Whale Dumps £12 Million, Backing Crypto’s ‘British Trump’
In the US, the crypto industry's big-money push to back Trump and reclaim regulatory control has already played out. In the UK, the same script is unfolding once again.

Ethereum Foundation Post: Rethinking the Division of Work Between L1 and L2 to Build the Ultimate Ethereum Ecosystem
Five years in the making, the Ethereum Foundation has updated the L1 and L2 ecosystem positioning and overarching guidance.

Two Major Prediction Market Platforms Unite Rarely, What Is the Story Behind This New Fund?
When Klashi's early employees went out to raise funds, the two CEOs chose to appear together on the investor list.

Dragonfly Partners: Most agents will not engage in autonomous trading, how can crypto payments prevail?
Although the scale of the agent economy will be very large, the proportion of agents actually conducting transactions will not be that high.
Morning News | Invesco acquires a $900 million on-chain fund from Superstate; ParaFi has raised $125 million for its new fund; Solana Foundation launches developer platform SDP
Overview of Important Market Events on March 24
What is the background of this new fund that the two major prediction market platforms have rarely joined forces to create?
When Klashi's early employees went out to raise funds, the two CEOs chose to appear together on the list of investors.
SIREN, another leveraged scam
What kind of experience can we gain from these similar situations?
Token has become extremely popular, and the blockchain is very sad
When AI's tokens become the new "digital oil," blockchain can only watch its once-dreamed dreams materialize in a completely unfamiliar way. This misaligned popularization is a victory for AI, but also the deepest helplessness for blockchain.
Tether's major shareholder invests £12 million to support the "British version of Trump" in the cryptocurrency sector
In the United States, the story of the cryptocurrency industry pouring money to support Trump and reclaiming regulatory dominance has come to an end. In the United Kingdom, the same script is being replayed.
Huang Renxun's Latest Podcast: Will NVIDIA Reach $1 Trillion? Will the Number of Programmers Increase Instead of Decrease? How to Deal with AI Anxiety?
Hashpower will determine everything; human work will only be restructured, not disappear
