Hugging Face, the machine learning community and AI tools platform, announced the release of HuggingChat, an open source ChatGPT clone that anyone can use or download for themselves.
What is Hugging Face?
Hugging Face is a company and an AI community. It provides access to free open source tools for developing machine learning and AI apps.
One of Hugging Face’s recently completed projects is a 176 billion parameter large language model called Bloom, which is available to anyone who agrees to abide by their Responsible AI license.
There is access to open source models in various categories such as multimodal, vision, audio, natural language processing, and reinforcement learning.
Hugging Face also hosts open source datasets and libraries and serves as a way for teams to collaborate, including a repository, similar to GitHub.
Many of the services are available for free, pro and enterprise levels.
Who Founded Hugging Face?
Hugging Face was established in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf with the intention of creating a chatbot app for teenagers. But the company’s direction shifted after they open-sourced the chatbot model, and they focused on creating a platform for machine learning. In March 2021, Hugging Face successfully raised $40 million in a Series B funding round.
What To Know About HuggingChat
The HuggingChat ChatGPT clone is based on the Open Assistant Conversational AI Model. Accessing HuggingChat is quick and straightforward – just visit HuggingFace.co/Chat and you’re ready to chat.
Open Assistant itself is a project of the non-profit Large-scale Artificial Intelligence Open Network (LAION).
LAION is a global non-profit organization dedicated to providing access to cutting edge technology as open source.
We believe that machine learning research and its applications have the potential to have huge positive impacts on our world and therefore should be democratized.
OUR PRINCIPAL GOALS
Releasing open datasets, code and machine learning models.
We want to teach the basics of large-scale ML research and data management.
By making models, datasets and code reusable without the need to train from scratch all the time, we want to promote an efficient use of energy and computing resources to face the challenges of climate change.”
The GitHub page for the Open Assistant chat model says:
“Open Assistant is a project meant to give everyone access to a great chat based large language model.
We believe that by doing this we will create a revolution in innovation in language.
In the same way that stable-diffusion helped the world make art and images in new ways we hope Open Assistant can help improve the world by improving language itself.”
HuggingChat Training Dataset
HuggingChat was trained with the OpenAssistant Conversations Dataset (OASST1), which is very new, containing data that was collected up to April 12 2023.
The research paper for the dataset dates from April 2023 (OpenAssistant Conversations – Democratizing Large Language Model Alignment – PDF).
This model uses the same training methodology created by OpenAI that’s called reinforcement learning from human feedback (RLHF).
RLHF is a technique for creating a high quality human annotated and quality rated dataset of questions and answers that can be used to train an AI to follow directions.
With this release they achieved their goal to put the RLHF technique within reach of anyone who wants to train an AI.
The research paper stated:
“In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.”
The dataset is the product of a worldwide crowdsourcing effort by over 13,000 volunteers.
Crowdsourcing was a good way to generate a multilingual training data which contributed to a high quality dataset.
However, according to the researchers, the crowdsourcing approach also introduced limitations in the quality of the dataset in the form of cultural and subjective biases of the individuals who created and rated the training data.
They also warned that participants who were more engaged tended to contribute more, thus creating an uneven distribution of their values and biases.
The researchers conclude that the dataset may not represent the diversity of viewpoints across all the contributors.
For example, they sent out a survey to their Discord channel (in English only) asking their open source contributors questions related to their demographics (but not ethnicity).
Setting aside the language bias, the results of the survey revealed that out of the 226 respondents, 201 were male, 10 were female, five identified as non-binary/other and 10 declined to answer.
Nevertheless, although they don’t guarantee 100% that the dataset is free from harmful content, they still stand behind it because it was created with strict quality guidelines.
The researchers write:
“To ensure the quality of our dataset, we have established strict contributor guidelines that all users must follow.
These guidelines are designed to prevent harmful content from being added to our dataset, and to encourage contributors to generate high-quality responses.”
HuggingChat Is Available
HuggingChat is open for users right now. Registration to create a login account is not necessary to use it.
Don’t expect ChatGPT level of output, the service is not at that level yet. The app page lists it as version 0.0, which should give an idea of how mature it is at this point.
Nevertheless it’s a remarkable achievement and first steps for the open source community and there is absolutely no charge to use it.
Visit the HuggingChat webpage here: