28M Hacker News comments as vector embedding search dataset
Edit this pageIntroduction The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384. This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data. Dataset details The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation. Steps Create tableCreate tableCreate the hackernews table to store the postings…