Databricks¶
Why¶
- Calculate lots of stuff
- Fast
What¶
- Managed Apache Spark
- Delta Lake
- all data in one place
- data can be used in multiple jobs
- at the same time
- ML Flow
- data scientist: how do I deploy this model
- model registry, deployment
- alternative: Kubeflow
- Code
- Jupyter notebooks
- Web UI
- execute jobs
- view run
- task orchestration
Apache Spark¶
Fault tolerant
Like Hadoop but faster????
Databricks helps you manage your Apache Spark faster
At Forma, we use Databricks to run commissions runs super quickly (order of magnitudes faster)
will feel at home with Jupyter notebooks
Data lakehouse¶
Data warehouse¶
- expensive hardware
- easily do analysis on your data
- security
Data lake¶
- massive amounts of data
- cheap hardware
lakehouse
- lake: cheap hardware and store tons of data
- still being able to do analysis on data
Databricks vs Snowflake¶
????
Last update:
2023-04-24