Language Classifying 5TB of Web Content per Day

Standard

At Spinn3r we index a lot of HTML. On an average day we index about 5TB of HTML content and write about 600GB of that to our Elasticsearch index. As part of our indexing we perform data augmentation including language detection.

read full details at http://ift.tt/2lfHFzi

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s