28M Hacker News comments as vector embedding search dataset

Edit this pageIntroduction​ The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384. This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data. Dataset details​ The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation. Steps​ Create tableCreate table​Create the hackernews table to store the postings…

Read more on Hacker News