Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic structure for the FastAPI server and command line #2

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
59 changes: 59 additions & 0 deletions CONTRIBUTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# ADALA Project Contribution Guide: Agent and Skill Development

Thank you for your interest in contributing to the ADALA Project's agent development! The robustness and versatility of our system primarily stem from the diverse agents and skills we deploy. This guide focuses on agent-related contributions, highlighting the importance of domain and task specificity.

## Areas of Contribution:

### Diverse Skills Contributions:

ADALA welcomes agents equipped with a wide range of skills, each offering unique capabilities. From tasks such as classification, anomaly detection, and regression to specialized roles like sentiment analysis or recommendation systems, there's endless potential to broaden our agent spectrum. Skills designed for specific domains (like medical, finance, or nature) or tailored tasks within these areas can considerably amplify the system's efficacy.

### Extending Skills:

Start with the foundational Skill class and extend it to facilitate Adala in acquiring new skills. To understand better, examine how the Classification or NamedEntity skills were implemented.

Example:

```python

```

### Domain-Specific Skills

Customize skills to particular domains, providing more profound insights and actionable feedback.

Example:

```python
```

#### Guidelines for New Skills:

- Uniqueness: Focus on specificity. What unique problem does your skill resolve?
- Integration: Ensure your skill aligns well with the existing ADALA framework.
- Documentation: Offer comprehensive documentation, usage instances for your agent, and a testing environment (with a ground truth dataset).
- Testing: Incorporate both unit and integration tests to guarantee a seamless integration with the ADALA system.

### New Runtimes

Introduce innovative runtimes utilizing varying language models or even distinct model types for labeling tasks. Enhancing current implementations through performance optimization or new feature introduction is also encouraged.

#### Adding a New Runtime:
To introduce a new runtime, adhere to the structure delineated by the Runtime abstract class. Below is a rudimentary example:

```python

```

## How to Contribute:

- Fork the Repository: Create a fork of the ADALA repository on your GitHub account.
- Clone, Branch, and Develop: Clone your fork, spawn a new branch for your contribution, and commence development.
- Test and Commit: After modifications, conduct comprehensive testing. Once content, commit with an informative message.
- Push and Pull Request: Push your amendments and formulate a pull request detailing your contribution's value.

## Code of Conduct:
While diverse contributions invigorate our project, it's paramount to sustain a harmonious and cooperative environment. Please adhere to our code of conduct.

## Questions or Discussions:
For inquiries or discussions concerning particular features, agents, or modifications, please initiate an issue. Your feedback propels the project's advancement.
165 changes: 163 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,163 @@
# ADALA
ADALA: Autonomous Data Labeling Agent
# ADALA <img src="https://htx-pub.s3.amazonaws.com/samples/Adala.png" width="100" align="right"/>
Automated Data Labeling Framework.

[![PyPI version](https://badge.fury.io/py/adala.svg)](https://badge.fury.io/py/adala)
[![Python version](https://img.shields.io/pypi/pyversions/adala.svg)](https://pypi.python.org/pypi/adala)
[![License](https://img.shields.io/pypi/l/adala.svg)](https://pypi.python.org/pypi/adala)


ADALA is a framework for automated data labeling. It uses a combination of Large Language Models (LLMs) autonomous agents and Active Learning (AL) to label data. It is designed to be used with [Label Studio](https://labelstud.io/) to provide a human-in-the-loop data labeling experience.

Here is what ADALA does:
- [LLM instructions generation](#llm-instructions-generation)
- [Predicting dataset with LLM](#predicting-dataset-with-llm)
- [Active learning with Human-in-the-Loop](#active-learning-with-human-in-the-loop)
- [LLM uncertainty estimation](#llm-uncertainty-estimation)


## Installation

Install ADALA:
```bash
git clone https://github.com/HumanSignal/ADALA.git
cd ADALA/
pip install -e .
```

If you're planning to use human-in-the-loop labeling, install Label Studio:
```bash
pip install label-studio
```

## Load dataset
ADALA works with datasets in various formats:
- [Pandas DataFrame](#pandas-dataframe)
- [Spark DataFrame](#spark-dataframe)

### Pandas DataFrame

Load the data into a pandas DataFrame:
```python
import pandas as pd
input_df = pd.read_csv('dataset.csv')
```

### Spark DataFrame

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
```


## Predicting dataset with LLM

ADALA inference is optimized to run in the batch mode - it is much faster to predict the whole dataset at once, rather than row-by-row.

Create LLM labeler:

```python
import adala as ad

labeler = ad.OpenAILabeler(model_name='gpt-4')
labeler.label_string('The sun is white.', instruction='Is it true?', labels=['yes', 'no'])
```

There are multiple LLM models available in the table below:
| Model | Initialize predictor |
| -------- | ------- |
| [OpenAI API](https://platform.openai.com/) | `ad.OpenAILabeler()` |
| [Any LangChain's LLM](https://python.langchain.com/docs/get_started/introduction.html) | `ad.LangChainLLMPredictor()` |
| [HuggingFace TGI](https://huggingface.co/text-generation-inference) | `ad.HuggingFaceLLMPredictor()` |
| [vLLM](https://vllm.ai/) | `ad.VLLMPredictor()` |
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | `ad.LlamaCppPredictor()` |


Predict the whole dataset:
```python
labeled_df = labeler.label(
df=input_df,
instruction='Predict sentiment',
labels=['positive', 'negative'],
output_column='predictions'
)
labeled_df['predictions']
```


## LLM instructions generation

ADALA can generate optimal LLM instructions for data labeling. You need to have an [OpenAI API](https://platform.openai.com/) key to use ADALA.

```bash
export OPENAI_API_KEY=your_key
```

The following method allows you to finetune instructions to classify each row in the DataFrame, given the ground truth labels in the specified column:
```python
import adala as ad

result = ad.generate_instructions(
labeler=labeler,
df=input_df,
ground_truth_column='label'
)
```

Now you can use the generated instructions to label the dataset with LLM:
```python
labeled_df = labeler.label(
df=input_df,
instruction=result.best_instruction,
labels=result.labels,
output_column='predictions'
)
```


## Active learning with Human-in-the-Loop

Combining instructions generation and dataset prediction, ADALA can be used to create a human-in-the-loop automated data labeling experience with Label Studio.

First [create a Label Studio project](https://labelstud.io/guide/setup_project).

> Note: Currently ADALA is designed to work with Text Classification projects. Go to `Labeling Setup > Natural Language Processing > Text Classification`. Change label names to match your dataset labels.

Get the project ID `project_id` from the URL, it will be used later.

Setup environment variables with [Label Studio API token](https://labelstud.io/guide/api#Authenticate-to-the-API) and Label Studio host:
```bash
export LABEL_STUDIO_API_TOKEN=your_token
export LABEL_STUDIO_HOST=http://localhost:8080
```

Generate LLM instructions with human-in-the-loop labeling:

```python
import adala as ad

labeled_df = ad.generate_instructions(
df=input_df,
# ... other parameters
human_in_the_loop=True,
label_studio_project_id=project_id,
# use your Label Studio API token and host if not set as environment variables
label_studio_api_token='your_token',
label_studio_host='your_host'
)
labeled_df['predictions']
```

## LLM uncertainty estimation

ADALA can be used to estimate LLM uncertainty for each row in the dataset. It is useful if you want to detect hallucinations or other forms of LLM errors.

```python
uncertainty_df = ad.estimate_uncertainty(
df=labeled_df,
instructions=instructions,
prediction_column='predictions',
output_column='uncertainty'
)
uncertainty_df['uncertainty']
```
Loading
Loading