Description
AutoFeat is an open-source automatic approach for feature discovery on tabular datasets.
Given a base table with a target variable and a repository of tabular datasets, AutoFeat helps to discover relevant features for augmentation among the tables from the data repository. The resulting augmented table will be a better training dataset for decision tree Machine Learning (ML) algorithms.
Authors
data:image/s3,"s3://crabby-images/80bb9/80bb9ce958d8f54f9acfeaadf8cd8b0a0b4c6df1" alt=""
TU Delft
data:image/s3,"s3://crabby-images/27e82/27e82e38e8d35d0d86d4b36d36edaa1817a88410" alt=""
TU Delft
data:image/s3,"s3://crabby-images/b3c8f/b3c8fa82d383730a597a03ddef5edb62ecc46084" alt=""
TU Delft
data:image/s3,"s3://crabby-images/9cc1d/9cc1d57d05f1d7b7ba56651767806f0b8b8a8b9b" alt=""
TU Delft
data:image/s3,"s3://crabby-images/96d41/96d41a3e65f04c209deeb67a7b17ac8cd0e12fc8" alt=""
TU Delft
AutoFeat Methods
- Dataset Discovery: AutoFeat uses Valentine to discover joinable tables.
- Graph Traversal: AutoFeat uses Breadth First Search to traverse the graph of connections, which helps us manage the error propagation.
- Streaming Feature Selection: AutoFeat uses streaming feature selection to navigate the space of joinable tables and select the relevant features for augmentation.
- Relevance: AutoFeat measures the relevance of features using Pearson correlation.
- Redundancy: AutoFeat removes redundant features using Minimum Redundancy Maximum Relevance algorithm.
Datasets
Dataset Source | # Rows | Processing strategy | # Joinable Tables | # Total Features | Links |
---|---|---|---|---|---|
jannis | 57581 | short_reverse_correlation | 12 | 55 | processed data |
miniboone | 73000 | short_reverse_correlation | 15 | 51 | processed data |
covertype | 423682 | short_reverse_correlation | 12 | 21 | processed data |
eyemove | 7609 | short_reverse_correlation | 6 | 24 | processed data |
credit | 1001 | short_reverse_correlation | 5 | 21 | processed data |
bioresponse | 3435 | short_reverse_correlation | 40 | 420 | procssed data |
steel | 1943 | short_reverse_correlation | 15 | 34 | processed data |
school | 1775 | None | 16 | 731 | original data |
Repositories
- https://github.com/delftdata/autofeat : Main repository containing the AutoFeat source code.
- https://github.com/kirilvasilev16/PythonTableDivider : Repository containing the dataset processing strategies.
- https://github.com/delftdata/bsc_research_project_q4_2023/tree/main/autofeat_experimental_analysis : Repository containing the evaluation of relevance and redundancy methods.
AutoFeat Papers
- [Pre-print] AutoFeat: Transitive Feature Discovery over Join Paths