Exploring Data Science Horizons: A Novice's Adventure to Becoming a Data Expert
Introduction:
As a curious child, I was always fascinated by Maths and numbers. The spark ignited when I was first introduced to Calculus and Linear Algebra in high school. Upon entering college for my bachelor's degree, I explored different avenues until I stumbled across the field of 'Data Science and AI.' My curiosity led me to delve into the subject, and I was captivated by the realization that this field was a fusion of Maths, Statistics, and Programming. The concept of using numbers to teach machines how humans perceive the world intrigued me even more, prompting me to immerse myself in this field. Given my non-Computer Science background and the lack of organized resources on the internet, stepping into this domain presented its own set of challenges.🤔
Gradually, my journey into the realm of data science unfolded. I delved into Statistics, mastering the basics of Python programming, and exploring various Machine Learning and Deep Learning models, all while actively seeking internships. A big shoutout goes to the seniors who generously shared their knowledge, helping me grasp the intricate concepts along the way.
This transformative journey reached a pivotal moment when I stumbled upon an opening for a Data Science internship at FutureSmart AI. Eagerly, I applied, and after a rigorous assessment, I was fortunate enough to secure the position. A year and a half of unwavering dedication culminated in this success, marking a significant milestone in my professional growth. 🚀
Now, as my remarkable journey with FutureSmart AI comes to an end, I find myself reflecting on the incredible expedition I've undertaken. Evolving from a novice in the realm of data science to a confident intern, I've amassed a wealth of knowledge and am eager to share the highlights of my discoveries.
Learning Experience:
At FutureSmart AI, my learning curve was very steep. While engaging in scalable real-life projects, I gained knowledge about Hugging Face Transformer models, state-of-the-art language models like GPT-4, and vector databases such as ChromaDB, LangChain, and Llamaindex, among other technologies. Throughout my internship, I had the opportunity to work on various projects, ranging from chatbot development for diverse clients to creating interactive applications that allow users to create and customize chatbots at the click of a button.
Additionally, I worked on applications for fetching insightful data from databases using human-like text, analyzing heart diseases, and building predictive models for disease classification in the medical domain. Despite the challenge of balancing my college responsibilities with my internship, I successfully met all deadlines. I also had the privilege of deploying these projects on cloud platforms like AWS after appropriately load-testing them. Furthermore, I used Streamlit to create the front-end for such applications.
The six months of internship not only enriched my technical skills but also honed my time management abilities, enabling me to handle both college work and deliver results at FutureSmart AI.
Major Contributions:
My first project involved creating a classification model using fine-tuned transformer models like BERT and language models (LLMs), such as GPT-4, to classify heart diseases, with the input to the models being reports from patients. During this time, I was introduced to super exciting libraries like LangChain. I also gained hands-on experience with few-shot prompting and GPT-3.5 fine-tuning. Additionally, I was introduced to deploying the project on a cloud-based web service like Amazon Web Services (AWS).
In my next project for an e-commerce company, I utilized OpenAI models to create an end-to-end chatbot seamlessly integrated with external knowledge bases such as ChromaDB and MySQL databases. These models were equipped with the functionality to call other external APIs, enhancing the chatbot's features, particularly with Retrieval Augmented Generation (RAG). The project involved the development of various API endpoints using FastAPI for an interactive application. Within this application, end-users could create new chatbots, view insights for a specific chatbot, access conversations between the chatbot and end-users, provide external sources (such as files, text, or URLs) to feed information to ChromaDB after appropriate parsing and chunking using LangChain, and delete any available information. Additionally, I took on the responsibility of load testing the chatbot, considering its deployment in over 80 countries. Given that this application was deployed on such a large scale, I went the extra mile by creating Grafana dashboards for server monitoring and setting up alerts in case of any API failures using PromQL.
In my third project, I once again contributed to the development of an end-to-end chatbot, this time for an educational platform, integrating ChromaDB and a MySQL database. I also worked on a chatbot using Llamaindex, capable of fetching results from an SQL database using human-like text. This project deepened my understanding of working with Llamaindex and LangChain and how to develop scalable solutions using them. With a Streamlit-like interface for the frontend and FastAPI endpoints for the backend, this project also featured an application interface similar to my second project.
I also worked on utilizing various APIs for scraping LinkedIn and used GPT-4 to generate customized messages for target users in different use cases. Additionally, I had the opportunity to explore the state-of-the-art open-source language model (LLM) named Llama-2 using LangChain and HuggingFace, which you can find more about here: Integrating Llama 2 with Hugging Face and Langchain🦙
Tools and Technologies Used:
Python:
Python has played a pivotal role in my professional endeavors, serving as the backbone for extensive data manipulation, analysis, and automation tasks. The rich ecosystem of Python libraries, including Pandas, NumPy, and scikit-learn, has been instrumental in streamlining the processing and handling of substantial datasets.
FastAPI:
FastAPI was used to develop various backend APIs due to its status as a modern, high-performance web framework designed specifically for building APIs.
OpenAI API, LangChain, and Llamaindex:
These libraries were employed to construct conversational AI systems capable of comprehending and generating contextually relevant, human-like text. Additionally, they empowered me to generate syntactically correct SQL code from human-like texts, subsequently executed on a database to address various use cases.
Hugging-Face Transformers and ChromaDB:
Hugging Face Transformers were used to generate embeddings for chunked texts, stored in a ChromaDB database. This database was queried to retrieve semantically similar texts, facilitating efficient handling of large volumes of data while exploring and extracting meaningful content.
MySQL:
MySQL databases played a pivotal role in efficiently handling data through well-optimized SQL code. During my internship, I delved into crafting efficient and optimized SQL code across multiple projects, aiming to deliver robust, scalable, and high-performance database solutions.
Pandas, NumPy, and Scikit-learn:
These fundamental data science libraries played a vital role in my work, enabling tasks such as data manipulation, numerical operations, and deploying machine learning algorithms for classification, regression, and clustering.
Streamlit:
Streamlit was used to develop interactive web applications, subsequently deployed on an AWS EC2 instance. This library provides an intuitive interface for chatbot development.
Postman:
Postman served as my go-to tool for testing and debugging API endpoints. Through its features, I could seamlessly send HTTP requests, examine responses, and validate the functionality embedded in the APIs I created.
Collaborative Tools:
During my internship, I leveraged various collaborative tools to promote effective teamwork and efficient project management. Git served as a reliable version control system, Jupyter Notebook, and Google Colab enabled interactive data analysis and rapid prototyping, while diverse collaboration platforms facilitated seamless communication within the team.
AWS:
AWS EC2 played a key role in orchestrating end-to-end machine learning projects, providing the foundational infrastructure and necessary resources to ensure the deployment and functionality of the projects.
Additional Soft skills:
During my internship, I not only honed my technical skills but also cultivated a myriad of invaluable soft skills that have undoubtedly enriched my professional repertoire.
Communication Skills: One of the foremost soft skills I acquired was effective communication. Through constant interaction with team members, superiors, and clients, I learned to articulate my ideas clearly and concisely.
Adaptability: Navigating the dynamic landscape of a real-world work environment necessitated a high level of adaptability. I quickly learned to embrace change, whether it be in project requirements, team structures, or technology stacks. This adaptability not only enhanced my problem-solving abilities but also instilled in me a sense of resilience in the face of unforeseen challenges.
Time Management: Balancing multiple tasks and deadlines taught me the importance of effective time management. Prioritizing assignments, meeting deadlines, and optimizing productivity became second nature.
Team Collaboration: Working collaboratively within a diverse team allowed me to appreciate the significance of teamwork. I enhanced my ability to collaborate with individuals possessing diverse skill sets and perspectives, fostering an environment conducive to innovation. This skill is instrumental in achieving collective goals and fostering a positive workplace culture.
Problem Solving: Real-world projects often present unforeseen challenges that require innovative solutions. Through my internship, I developed strong problem-solving skills by approaching issues with a systematic and analytical mindset. This skill is vital in troubleshooting technical problems and devising efficient solutions that contribute to project success.
Conclusion:
Finally, I am truly grateful for the guidance and mentorship provided by FutureSmart AI throughout all the projects. My experience with Large Language Models, combined with proficiency in Python and various libraries, and the utilization of cloud platforms like AWS, has equipped me with the capabilities to address real-world challenges and implement scalable solutions.
Having been a part of a team with industry veterans at FutureSmart AI, I can vouch that there has been exceptional growth in my problem-solving and programming skills. To anyone who wishes to join the company in the future, you can expect phenomenal growth, both professionally and personally, as FutureSmart AI provides a holistic environment for development.🤗
Reflecting on my internship journey, I am thankful for the invaluable hands-on experiences and mentorship I have received. The skills and knowledge gained during this internship will be pivotal in laying the groundwork for my future ventures in the realm of data science. I look forward to advancing my professional development, actively contributing to innovative projects, and leveraging data-driven insights to make a positive impact.
Until then, stay curious and keep learning!🔥