7 YouTube channels to learn machine learning

July 31, 2023 Coin Telegraph

YouTube channels, including Sentdex and Data School, offer in-depth data science and machine learning explorations to enhance data-driven decision-making.

Machine learning is a fascinating and rapidly growing field revolutionizing various industries. If you’re interested in diving into the world of machine learning and developing your skills, YouTube can be an excellent platform to start your learning journey.

Numerous YouTube channels are dedicated to teaching machine learning concepts, algorithms and practical applications. This article will explore seven top YouTube channels that offer high-quality content to help you grasp the fundamentals and advance your machine-learning expertise.

3Blue1Brown

Grant Sanderson’s YouTube channel, 3Blue1Brown, has gained fame for its exceptional ability to elucidate intricate mathematical and machine learning concepts using captivating, intuitive animations.

In math, you can sometimes prove a claim before fully understanding it.

But you cannot fully understand a claim without also being able to prove it.
— Grant Sanderson (@3blue1brown) May 11, 2023

Catering to a wide audience, the channel is widely recognized as a leading resource for mathematics, data science and machine learning topics. Its unique approach to presenting complex subjects has earned it a reputation as one of the finest educational channels in these fields.

Sentdex

Harrison Kinsley’s company, Sentdex, provides a vast library of lessons and guidance on machine learning. The channel focuses on Python programming for machine learning, including subjects like data analysis, deep learning, gaming, finance and natural language processing.

Sentdex is an excellent resource for anyone trying to advance their machine learning knowledge using Python, with clear explanations and useful examples.

Corey Schafer

Although not exclusively devoted to machine learning, Corey Schafer’s YouTube channel includes several great videos on data science and Python programming. His machine learning lessons cover a range of topics, including model training, model evaluation and data pre-processing. Learners can better comprehend the fundamental ideas and practical features of machine learning algorithms thanks to Schafer’s in-depth lectures and coding demonstrations.

Siraj Raval

The YouTube channel of Siraj Raval is well known for making difficult machine learning concepts understandable. His enthusiastic and upbeat teaching style makes learning fun and interesting. The channel offers a variety of content, such as walkthroughs of projects, tutorials and discussions on the most recent artificial intelligence (AI) research.

Raval’s channel is ideal for both beginning and seasoned learners wishing to advance their skills because it heavily emphasizes hands-on projects.

Can a customized GPT make profitable sports bets? I built a NodeJS web-app that uses a combination of ChatGPT and Bing to analyze:
1. Historical sports datasets
2. Twitter Sentiment
3. Live betting odds from the Web

Process, code, & results in the video!https://t.co/tdMjv6eklL
— Siraj Raval (@sirajraval) May 16, 2023

StatQuest with Josh Starmer

StatQuest is an exceptional channel for understanding the statistical concepts behind machine learning algorithms. Hosted by Josh Starmer, former assistant professor at the University of North Carolina at Chapel Hill, the channel uses visual explanations and analogies to simplify complex statistical ideas.

By gaining a solid understanding of statistics, viewers can better grasp the working principles of various machine learning models.

Data School

Kevin Markham’s data science and machine learning tutorials using Python and well-known tools like Scikit-Learn and Pandas are the main focus of Data School. The channel provides extensive playlists that cover machine learning algorithms, data visualization and actual data projects. Learners with little to no prior machine learning experience will benefit from Markham’s well-structured and beginner-friendly teaching style.

DeepLearningAI

Andrew Ng, a renowned AI researcher who established Google Brain, is the founder of DeepLearningAI. The platform has gained immense global popularity through his deep learning specialization on Coursera.

A classic.

Original: Data Science memes on Reddit pic.twitter.com/vdc21pHjLw
— DeepLearning.AI (@DeepLearningAI_) July 24, 2023

The DeepLearningAI channel provides a diverse range of educational content, including video lectures, tutorials, interviews with industry experts, and interactive live Q&A sessions. In addition to being an invaluable learning resource, DeepLearningAI keeps its viewers well-informed about the latest trends in machine learning and deep learning.

7 free learning resources to land top data science jobs

March 22, 2023 Coin Telegraph

career development

Discover seven free resources to learn data science and land top jobs.

Data science is an exciting and rapidly growing field that involves extracting insights and knowledge from data. To land a top data science job, it is important to have a solid foundation in key data science skills, including programming, statistics, data manipulation and machine learning.

Fortunately, there are many free online learning resources available that can help you develop these skills and prepare for a career in data science. These resources include online learning platforms such as Coursera, edX and DataCamp, which offer a wide range of courses in data science and related fields.

Coursera

Data science and related subjects are covered in a variety of courses on the online learning platform Coursera. These courses frequently involve subjects such as machine learning, data analysis and statistics and are instructed by academics from prestigious universities.

Here are some examples of data science courses on Coursera:

Applied Data Science with Python Specialization: This specialization, offered by the University of Michigan, consists of five courses that cover the basics of data manipulation, analysis and visualization using Python.
Machine Learning by Andrew Ng: This course, offered by Stanford University, provides an introduction to machine learning, including topics such as linear regression, logistic regression, neural networks and clustering.
Data Science Methodology: This course, offered by IBM, covers the basics of data science, including data preparation, data cleaning and data exploration.
Statistics with R Specialization: This specialization, offered by Duke University, consists of four courses that cover statistical inference, regression modeling and machine learning using the R programming language.

I started my ML career 2019 with Coursera IBM Data Science courses @coursera with MS engineering background. Fascinating to learn daily of AI
— Risto Anton (@blogtheristo) March 17, 2023

One can apply for financial aid to earn these certifications for free. However, doing a course just for certification may not land a dream job in data science.

Kaggle

Kaggle is a platform for data science competitions that provides a wealth of resources for learning and practicing data science skills. One can refine their skills in data analysis, machine learning and other branches of data science by participating in the platform’s challenges and host of datasets.

Here are some examples of free courses available on Kaggle:

Python: This course covers the basics of Python programming, including data types, control structures, functions and modules.
Pandas: This course covers the basics of data manipulation using Pandas, including data cleaning, data merging and data reshaping.
Data Visualization: This course covers the basics of data visualization using Matplotlib and Seaborn, including scatter plots, line plots and bar plots.
Intro to Machine Learning: This course covers the basics of machine learning, including classification, regression and clustering.
Intermediate Machine Learning: This course covers more advanced topics in machine learning, including feature engineering, model selection and hyperparameter tuning.
SQL: This course covers the basics of SQL, including data querying, data filtering and data aggregation.
Deep Learning: This course covers the basics of deep learning, including neural networks, convolutional neural networks and recurrent neural networks.

Kaggle is a superb platform for Python & Machine Learning

Open this if you wish to use it to maximum potential ⏬⏬⏬
— Jaydeep (@_jaydeepkarale) March 15, 2023

edX

EdX is another online learning platform that offers courses in data science and related fields. Many of the courses on edX are taught by professors from top universities, and the platform offers both free and paid options for learning.

Some of the free courses on data science available on edX include:

Data Science Essentials: This course, offered by Microsoft, covers the basics of data science, including data exploration, data preparation and data visualization. It also covers key topics in machine learning, such as regression, classification and clustering.
Introduction to Python for Data Science: This course, offered by Microsoft, covers the basics of Python programming, including data types, control structures, functions and modules. It also covers key data science libraries in Python, such as Pandas, NumPy and Matplotlib.
Introduction to R for Data Science: This course, offered by Microsoft, covers the basics of R programming, including data types, control structures, functions and packages. It also covers key data science libraries in R, such as dplyr, ggplot2 and tidyr.

All of these courses are free to audit, meaning that you can access all the course materials and lectures without paying a fee. Nevertheless, there will be a cost if you wish to access further course features or receive a certificate of completion. A comprehensive selection of paid courses and programs in data science, machine learning and related topics are also available on edX in addition to these courses.

DataCamp

DataCamp is an online learning platform that offers courses in data science, machine learning and other related fields. The platform offers interactive coding challenges and projects that can help you build real-world skills in data science.

The following courses are available for free on DataCamp:

Introduction to Python: This course covers the basics of Python programming, including data types, control structures, functions and modules.
Introduction to R: This course covers the basics of R programming, including data types, control structures, functions and packages.
Introduction to SQL: This course covers the basics of SQL, including data querying, data filtering and data aggregation.
Data Manipulation with Pandas: This course covers the basics of data manipulation using Pandas, including data cleaning, data merging and data reshaping.
Importing Data in Python: This course covers the basics of importing data into Python, including reading files, connecting to databases and working with web APIs.

All of these courses are free and can be accessed through DataCamp’s online learning platform. In addition to these courses, DataCamp also offers a wide range of paid courses and projects that cover topics such as data visualization, machine learning and data engineering.

Udacity

Udacity is an online learning platform that offers courses in data science, machine learning and other related fields. The platform offers both free and paid courses, and many of the courses are taught by industry professionals.

Here are some examples of free courses on data science available on Udacity:

Introduction to Python Programming: This course covers the basics of Python programming, including data types, control structures, functions and modules. It also covers key data science libraries in Python, such as NumPy and Pandas.
SQL for Data Analysis: This course covers the basics of SQL, including data querying, data filtering and data aggregation. It also covers more advanced topics in SQL, such as joins and subqueries.
Intro to Data Science: This course covers the basics of data science, including data wrangling, exploratory data analysis and statistical inference. It also covers key machine-learning techniques, such as regression, classification and clustering.

MIT OpenCourseWare

MIT OpenCourseWare is an online repository of course materials from courses taught at the Massachusetts Institute of Technology. The platform offers a variety of courses in data science and related fields, and all of the materials are available for free.

Here are some of the free courses on data science available on MIT OpenCourseWare:

Introduction to Computer Science and Programming in Python: This course covers the basics of Python programming, including data types, control structures, functions and modules. It also covers key data science libraries in Python, such as NumPy, Pandas and Matplotlib.
Introduction to Probability and Statistics: This course covers the basics of probability theory and statistical inference, including probability distributions, hypothesis testing and confidence intervals.
Machine Learning with Large Datasets: This course covers the basics of machine learning, including linear regression, logistic regression and k-means clustering. It also covers techniques for working with large data sets, such as map-reduce and Hadoop.

GitHub

GitHub is a platform for sharing and collaborating on code, and it can be a valuable resource for learning data science skills. However, GitHub itself does not offer free courses. Instead, one can explore the many open-source data science projects that are hosted on GitHub to find out more about how data science is used in practical situations.

Scikit-learn is a popular Python library for machine learning, which provides a range of algorithms for tasks such as classification, regression and clustering, along with tools for data preprocessing, model selection and evaluation. The project is open-source and available on GitHub.

Please don't apply for senior dev roles unless your GitHub looks like this pic.twitter.com/6wptzkrMb2
— Nat Miletic (@natmiletic) February 27, 2023

Jupyter is an open-source web application for creating and sharing interactive notebooks. Jupyter notebooks provide a way to combine code, text and multimedia content in a single document, making it easy to explore and communicate data science results.

These are just a few examples of the many open-source data science projects available on GitHub. By exploring these projects and contributing to them, one can gain valuable experience with data science tools and techniques, while also building their portfolio and demonstrating their skills to potential employers.

9 data science project ideas for beginners

March 17, 2023 Coin Telegraph

beginners

Get started with nine beginner-friendly data science project ideas to enhance your skills and portfolio.

Beginners should undertake data science projects as they provide practical experience and help in the application of theoretical concepts learned in courses, building a portfolio and enhancing skills. This allows them to gain confidence and stand out in the competitive job market.

If you’re considering a data science dissertation project or simply want to showcase proficiency in the field by conducting independent research and applying advanced data analysis techniques, the following project ideas may prove useful.

Sentiment analysis of product reviews

This involves analyzing a data set and creating visualizations to better understand the data. For instance, a project idea may be to examine user evaluations of products on Amazon using natural language processing (NLP) methods to ascertain the general mood toward such things. To accomplish this, a sizable collection of product reviews from Amazon can be gathered by using web scraping methods or an Amazon product API.

One of my favorite datasets on Kaggle:

Amazon Reviews

Ideas for your project:

• Calculate basic product analytics
• Use clustering algorithms to group products
• Endless NLP use cases: sentiment analysis, keyword extraction, summarization

Check it out!
— David Miller (@thedavescience) October 21, 2022

Once the data has been gathered, it can be preprocessed by having stop words, punctuation and other noise removed. The polarity of the review, or whether the sentiment indicated in it is favorable, negative or neutral, can then be determined by applying a sentiment analysis algorithm to the preprocessed language. In order to comprehend the general opinion of the product, the results might be represented using graphs or other data visualization tools.

Predicting house prices

This project involves building a machine learning model to predict house prices based on various factors such as location, square footage, and the number of bedrooms.

Using a machine learning model that uses housing market data, such as location, the number of bedrooms and bathrooms, square footage and previous sales data, to estimate the sale price of a particular house is one example of a data science project connected to predicting house prices.

The model could be trained on a data set of past house sales and tested on a separate data set to evaluate its accuracy. The ultimate objective would be to offer perceptions and forecasts that might help real estate brokers, buyers and sellers make wise choices regarding price and buying/selling tactics.

Customer segmentation

A customer segmentation project involves using clustering algorithms to group customers based on their purchasing behavior, demographics and other factors.

The Role of Data Science in Customer Segmentation

Data science has revolutionized the field of customer segmentation by providing businesses with the tools to analyze vast amounts of data quickly and accurately.
— Mastermindzero (@Mg_S_) March 9, 2023

A data science project related to customer segmentation could involve analyzing customer data from a retail company, such as transaction history, demographics and behavioral patterns. The goal would be to identify distinct customer segments using clustering techniques to group customers with similar characteristics together and identify the factors that differentiate each group.

This analysis could provide insights into customer behavior, preferences and needs, which could be used to develop targeted marketing campaigns, product recommendations and personalized customer experiences. By increasing customer satisfaction, loyalty and profitability, the retail company can benefit from the results of this project.

Fraud detection

This project involves building a machine learning model to detect fraudulent transactions in a data set. Using machine learning algorithms to examine financial transaction data and spot patterns of fraudulent activity is an example of a data science project related to fraud detection.

The ultimate objective is to create a reliable fraud detection model that can assist financial institutions in preventing fraudulent transactions and safeguarding the accounts of their consumers.

Image classification

This project involves building a deep learning model to classify images into different categories. An image classification data science project could involve building a deep learning model to classify images into different categories based on their visual features. The model could be trained on a large data set of labeled images and then tested on a separate data set to evaluate its accuracy.

The end goal would be to provide an automated image classification system that can be used in various applications, such as object recognition, medical imaging and self-driving cars.

Time series analysis

This project involves analyzing data over time and making predictions about future trends. A time series analysis project could involve analyzing historical price data for a specific cryptocurrency, such as Bitcoin (BTC), using statistical models and machine learning techniques to forecast future price trends.

The objective would be to offer perceptions and forecasts that can assist traders and investors in making wise choices about the purchase, sale and storage of cryptocurrencies.

Recommendation system

This project involves building a recommendation system to suggest products or content to users based on their past behavior and preferences.

Recommendation systems are one of the most widely used topics of machine learning.

Netflix, YouTube, Amazon: they all use a recommendation system at their core.

Here is a great dataset to learn: https://t.co/j418uwjawL

45,000+ movies. 26M ratings from over 270,000 users. pic.twitter.com/P3HhFKCixQ
— Abacus.AI (@abacusai) January 21, 2023

A recommendation system project could involve analyzing Netflix user data, such as viewing history, ratings and search queries, to make personalized movie and TV show recommendations. The goal is to provide users with a more personalized and relevant experience on the platform, which could increase engagement and retention.

Web scraping and data analysis

Web scraping is the automated collection of data from multiple websites using software like BeautifulSoup or Scrapy, while data analysis is the process of analyzing the acquired data using statistical methods and machine learning algorithms. The project could involve scraping data from a website and analyzing it using data science methods to gain insights and make predictions.

Furthermore, it can entail gathering information about customer behavior, market trends or other pertinent subjects with the intention of offering organizations or individuals insights and practical advice. The ultimate goal is to use the massive volumes of data that are readily accessible online to produce insightful discoveries and guide data-driven decision-making.

Blockchain transaction analysis

A blockchain transaction analysis project involves analyzing blockchain network data, such as Bitcoin or Ethereum, to identify patterns, trends and insights about transactions on the network. This can help improve understanding of blockchain-based systems and potentially inform investment decisions or policy-making.

The key goal is to use the blockchain’s openness and immutability to obtain fresh knowledge about how network users behave and make it possible to build decentralized apps that are more durable and resilient.

5 high-paying careers in data science

February 8, 2023 Coin Telegraph

Big Data

Data science careers tend to have high salaries — often over six figures — as the demand for skilled professionals in this field continues to grow.

Data science plays a critical role in supporting decision-making processes by providing insights and recommendations based on data analysis. In order to create new products, services and procedures, businesses can use data science to gain a deeper understanding of consumer behavior, market trends and corporate performance.

By giving businesses a competitive edge in the market through better decision-making, increased consumer involvement and more efficient corporate processes, it enables companies to achieve a competitive advantage. The demand for data science experts is rising quickly, opening up new possibilities for development on both a personal and professional level.

Here are five high-paying careers in data science.

Data scientist

A data scientist is a specialist who draws conclusions and knowledge from both structured and unstructured data using scientific methods, processes, algorithms and systems. They create models and algorithms to categorize data, make predictions and find hidden patterns. Additionally, they clearly and effectively communicate their findings and outcomes to all relevant parties.

Data scientists have solid backgrounds in statistics, mathematics and computer science, as well as a practical understanding of the Python and R programming languages and expertise in dealing with sizable data sets. The position calls for a blend of technical and analytical abilities, as well as the capacity to explain complicated results to non-technical audiences.

A data scientist in the United States can expect to earn $121,169 per year, according to Glassdoor. Additionally, advantages like stock options, bonuses and profit-sharing are frequently included in remuneration packages for data scientists. However, a data scientist’s pay might vary significantly depending on a number of variables, including geography, industry, years of experience and educational background.

Machine learning engineer

A machine learning engineer is responsible for designing, building and deploying scalable machine learning models for real-world applications. They create and use algorithms to decipher complex data, interpret it and make predictions. In order to incorporate these models into a finished product, they also work with software engineers.

Typically, a machine learning engineer has a solid foundation in programming, computer science and mathematics. In the U.S., the average income for a machine learning engineer is $136,150, while top earners in big cities or those with substantial expertise may make considerably more.

Big data engineer

The architecture of a company’s big data infrastructure is created, built and maintained by big data engineers. They use a variety of big data technologies, including Hadoop, Spark and NoSQL databases, to design, build and manage the storage, processing and analysis of huge and complex data sets.

They also work along with data scientists, data analysts and software engineers to develop and implement big data solutions that satisfy an organization’s business needs. In the U.S., a data engineer can expect to make an average annual salary of $114,501.

Business intelligence manager

An organization’s decision-making processes are supported by data-driven solutions, which are developed and implemented under the direction of a business intelligence (BI) manager. They coordinate the implementation of BI tools and systems, create and prioritize business intelligence initiatives, and work in close collaboration with data analysts, data scientists and IT teams.

The data used in these solutions must be of a high standard, and BI managers must convey the findings and insights to senior leaders and stakeholders in order to inform business strategy. They are essential in creating and maintaining data governance and security rules that safeguard confidential corporate data. The salary range for a business intelligence manager in the U.S. normally ranges from $122,740 to $157,551. And the average compensation is $140,988 per annum.

Data analyst manager

A data analyst manager is responsible for leading a team of data analysts and overseeing the collection, analysis and interpretation of large and complex data sets. They develop and implement data analysis strategies, using various tools and technologies, to support decision-making processes and inform business strategy.

To make sure that data analysis initiatives are in line with company goals and objectives, data analyst managers closely collaborate with data scientists, business intelligence teams and senior management. They also play a crucial part in guaranteeing the accuracy and quality of the data used in analytic initiatives, as well as in conveying findings and suggestions to stakeholders. They could also be in charge of overseeing the allocation of resources and managing the budget for projects involving data analysis. In the U.S., a data analyst makes an average base salary of $66,859.

How we scaled data streaming at Coinbase using AWS MSK

August 24, 2021 Coinbase

coinbase

By: Dan Moore, Eric Sun, LV Lu, Xinyu Liu

Tl;dr: Coinbase is leveraging AWS’ Managed Streaming for Kafka (MSK) for ultra low latency, seamless service-to-service communication, data ETLs, and database Change Data Capture (CDC). Engineers from our Data Platform team will further present this work at AWS’ November 2021 Re:Invent conference.

Abstract

At Coinbase, we ingest billions of events daily from user, application, and crypto sources across our products. Clickstream data is collected via web and mobile clients and ingested into Kafka using a home-grown Ruby and Golang SDK. In addition, Change Data Capture (CDC) streams from a variety of databases are powered via Kafka Connect. One major consumer of these Kafka messages is our data ETL pipeline, which transmits data to our data warehouse (Snowflake) for further analysis by our Data Science and Data Analyst teams. Moreover, internal services across the company (like our Prime Brokerage and real time Inventory Drift products) rely on our Kafka cluster for running mission-critical, low-latency (sub 10 msec) applications.

With AWS-managed Kafka (MSK), our team has mitigated the day-to-day Kafka operational overhead of broker maintenance and recovery, allowing us to concentrate our engineering time on core business demands. We have found scaling up/out Kafka clusters and upgrading brokers to the latest Kafka version simple and safe with MSK. This post outlines our core architecture and the complete tooling ecosystem we’ve developed around MSK.

Configuration and Benefits of MSK

Config:

TLS authenticated cluster
30 broker nodes across multiple AZs to protect against full AZ outage
Multi-cluster support
~17TB storage/broker
99.9% monthly uptime SLA from AWS

Benefits:

Since MSK is AWS managed, one of the biggest benefits is that we’re able to avoid having internal engineers actively maintain ZooKeeper / broker nodes. This has saved us 100+ hours of engineering work as AWS handles all broker security patch updates, node recovery, and Kafka version upgrades in a seamless manner. All broker updates are done in a rolling fashion (one broker node is updated at a time), so no user read/write operations are impacted.

Moreover, MSK offers flexible networking configurations. Our cluster has tight security group ingress rules around which services can communicate directly with ZooKeeper or MSK broker node ports. Integration with Terraform allows for seamless broker addition, disk space increases, configuration updates to our cluster without any downtime.

Finally, AWS has offered excellent MSK Enterprise support, meeting with us on several occasions to answer thorny networking and cluster auth questions.

Performance:

We reduced our end-to-end (e2e) latency (time taken to produce, store, and consume an event) by ~95% when switching from Kinesis (~200 msec e2e latency) to Kafka (<10msec e2e latency). Our Kafka stack’s p50 e2e latency for payloads up to 100KB averages <10 msec (in-line with LinkedIn as a benchmark, the company originally behind Kafka). This opens doors for ultra low latency applications like our Prime Brokerage service. Full latency breakdown from stress tests on our prod cluster, by payload size, presented below:

Proprietary Kafka Security Service (KSS)

What is it?

Our Kafka Security Service (KSS) houses all topic Access Control Lists (ACLs). On deploy, it automatically syncs all topic read/write ACL changes with MSK’s ZooKeeper nodes; effectively, this is how we’re able to control read/write access to individual Kafka topics at the service level.

KSS also signs Certificate Signing Requests (CSRs) using the AWS ACM API. To do this, we leverage our internal Service-to-Service authentication (S2S) framework, which gives us a trustworthy service_id from the client; We then use that service_id and add it as the Distinguished Name in the signed certificate we return to the user.

With a signed certificate, having the Distinguished Name matching one’s service_id, MSK can easily detect via TLS auth whether a given service should be allowed to read/write from a particular topic. If the service is not allowed (according to our acl.yml file and ACLs set in ZooKeeper) to perform a given action, an error will occur on the client side and no Kafka read/write operations will occur.

Also Required

Parallel to KSS, we built a custom Kafka sidecar Docker container that: 1) Plugs simply into one’s existing docker-compose file 2) Auto-generates CSRs on bootup and calls KSS to get signed certs, and 3) Stores credentials in a Docker shared volume on user’s service, which can be used when instantiating a Kafka producer / consumer client so TLS auth can occur.

Rich Data Stream Tooling

We’ve extended our core Kafka cluster with the following powerful tools:

Kafka Connect

This is a distributed cluster of EC2 nodes (AWS autoscaling group) that performs Change Data Capture (CDC) on a variety of database systems. Currently, we’re leveraging the MongoDB, Snowflake, S3, and Postgres source/sink connectors. Many other connectors are available open-source through Confluent here

Kafdrop

We’re leveraging the open-source Kafdrop product for first-class topic/partition offset monitoring and inspecting user consumer lags: source code here

Cruise Control

This is another open-source project, which provides automatic partition rebalancing to keep our cluster load / disk space even across all broker nodes: source code here

Confluent Schema Registry

We use Confluent’s open-source Schema Registry to store versioned proto definitions (widely used along Coinbase gRPC): source code here

Internal Kafka SDK

Critical to our streaming stack is a custom Golang Kafka SDK developed internally, based on the segmentio/kafka release. The internal SDK is integrated with our Schema Registry so that proto definitions are automatically registered / updated on producer writes. Moreover, the SDK gives users the following benefits out of the box:

Consumer can automatically deserialize based on magic byte and matching SR record
Message provenance headers (such as service_id, event_time, event_type) which help conduct end-to-end audits of event stream completeness and latency metrics
These headers also accelerate message filtering and routing by avoiding the penalty of deserializing the entire payload

Streaming SDK

Beyond Kafka, we may still need to make use of other streaming solutions, including Kinesis, SNS, and SQS. We introduced a unified Streaming-SDK to address the following requirements:

Delivering a single event to multiple destinations, often described as ‘fanout’ or ‘mirroring’. For instance, sending the same message simultaneously to a Kafka topic and an SQS queue
Receiving messages from one Kafka topic, emitting new messages to another topic or even a Kinesis stream as the result of data processing
Supporting dynamic message routing, for example, messages can failover across multiple Kafka clusters or AWS regions
Offering optimized configurations for each streaming platform to minimize human mistakes, maximize throughput and performance, and alert users of misconfigurations

Upcoming

On the horizon is integration with our Delta Lake which will fuel more performant, timely data ETLs for our data analyst and data science teams. Beyond that, we have the capacity to 3x the number of broker nodes in our prod cluster (30 -> 90 nodes) as internal demand increases — that is a soft limit which can be increased via an AWS support ticket.

Takeaways

Overall, we’ve been quite pleased with AWS MSK. The automatic broker recovery during security patches, maintenance, and Kafka version upgrades along with the advanced broker / topic level monitoring metrics around disk space usage / broker CPU, have saved us hundreds of hours provisioning and maintaining broker and ZooKeeper nodes on our own. Integration with Terraform has made initial cluster configuration, deployment, and configuration updates relatively painless (use 3AZs for your cluster to make it more resilient and prevent impact from a full-AZ outage).

Performance has exceeded expectations, with sub 10msec latencies opening doors for ultra high-speed applications. Uptime of the cluster has been sound, surpassing the 99.9% SLA given by AWS. Moreover, when any security patches take place, it’s always done in a rolling broker fashion, so no read/write operations are impacted (set default topic replication factor to 3, so that min in-sync replicas is 2 even with node failure).

We’ve found building on top of MSK highly extensible having integrated Kafka Connect, Confluent Schema Registry, Kafdrop, Cruise Control, and more without issue. Ultimately, MSK has been beneficial for both our engineers maintaining the system (less overhead maintaining nodes) and unlocking our internal users and services with the power of ultra-low latency data streaming.

If you’re excited about designing and building highly-scalable data platform systems or working with cutting-edge blockchain data sets (data science, data analytics, ML), come join us on our mission building the world’s open financial system: careers page.

How we scaled data streaming at Coinbase using AWS MSK was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

data-science