Skip to content

Databricks

Why

  • Calculate lots of stuff
  • Fast

What

  1. Managed Apache Spark
  2. Delta Lake
    1. all data in one place
    2. data can be used in multiple jobs
      1. at the same time
  3. ML Flow
    1. data scientist: how do I deploy this model
    2. model registry, deployment
    3. alternative: Kubeflow
  4. Code
    1. Jupyter notebooks
  5. Web UI
    1. execute jobs
    2. view run
    3. task orchestration

Apache Spark

Fault tolerant

Like Hadoop but faster????

Databricks helps you manage your Apache Spark faster

At Forma, we use Databricks to run commissions runs super quickly (order of magnitudes faster)

will feel at home with Jupyter notebooks

Data lakehouse

Data warehouse

  • expensive hardware
  • easily do analysis on your data
  • security

Data lake

  • massive amounts of data
  • cheap hardware

lakehouse

  1. lake: cheap hardware and store tons of data
  2. still being able to do analysis on data

Databricks vs Snowflake

????


Last update: 2023-04-24