Reproducing SOTA works as a pathway to get into research and preparation for a bachelor thesis

by Rabiul Awal on July 1, 2020

Introduction

It is quite common for undergrad students to face indecisiveness on problems they should choose for a bachelor thesis. Also, students from low resourced institutions find it’s very hard to think about undergrad research. It’s very unlikely for them to get into research due to a lack of proper guidance and motivation. Today I will be talking about how motivated ones can develop a pathway to academic research in an independent manner. Research in Computer Science is very hard and one needs to excel enough to do research and publish them in a world-class venue. Truth is doing scientific research is not all, one supposed to publish it through a standard peer-review. Ending up with a publishable piece of research is important for two reasons. First, your works are being reviewed by others and eventually that ensure the quality of your work. Second, it is important for your career. You can use a bachelor thesis for higher study and maybe you can manage a job in research-based institutions, industry, etc. You can also use your training for further research works and academic collaborations.

Motivation

Institutions like NSTU whom I prefer to consider as low resourced places is not quite supportive of scientific research either undergraduate or graduate level. It’s very rare that grad students produce quality research and prepare themselves for future careers along with this line. But it is possible to do research during undergrad and build a career using that training, experiences, and certifications.

Typically, research is not something that comes from nowhere. Most of the academic institutions which are producing tons of work every year have already developed an environment for this. Those institutions are filled with faculties and research scientists who train undergrad and graduate students for academic research. Sadly we don’t get that training.

So, to come up with a piece of good research work one must follow systematic research methodologies. That brings the question of what those methodologies are and what training one should have for that. Every year thousands of papers get published in academic conferences in many fields of computer science. I myself follow machine learning, AI, and NLP tracks, so I shall talk about them. But this case should be applicable to other fields too. There are specific research methodologies one needs to follow to end up with successful work. Reproducing State of The Art research works could be one of the possible pathways for interested minds to train them mostly on their own. One major concern of research is addressing a novelty. By novelty, we mean that you have worked on a problem that pushed existing work in that field, developing new observations, introducing new problems to solve, developing a new metric for qualifying a solution of a specific task, theoretical contribution advancing that field, devising a new algorithm or tools to solve a specific task, etc etc.

Why reproducing is good for early researchers?

Reproducing does not fall into a strong novelty category. Reproducing is just re-doing some research done by some others and training oneself through working on that already solved problem. So, why reproducing matters? It matters cause you are teaching yourself and learning using a standard piece of work! As I said earlier, research requires some standards to be met, demands certain procedures, and a systematic approach. If you want to work on a completely new problem you may miss some important parts as no one is teaching/mentoring you! But if you just redo something done by an expert, you are required to follow the quality. This could be compared with using other codes at the early age of learning computer programming.

How does a research paper look like?

Let’s see what contains in a typical research paper in the field of machine learning aka natural language processing –

  1. you come up with a problem or idea
  2. compile a proposal addressing your problem
  3. introduce the problem and motivate that idea
  4. review existing works done by others on this specific problem, a discussion on what has been done on this problem till date
  5. describing your contribution and supporting theoretical or practical hypothesis on your contribution
  6. developing methods, designing experiments
  7. metric for measuring the stated contribution
  8. empirical evaluation and technical analysis
  9. conclusion and future works

The process of reproducing a piece of research

Let’s assume you picked one research paper of this kind. What you will learn by reproducing that piece of research?

  • You get to read ~30-50 papers
  • You will learn how to state a problem. And why solving that problem is important for society and for the very field. How to write a formal introduction and how to motivate that problem.
  • For machine learning, there are certain models and algorithms that are commonly used. From the related works section, you will learn about all those tools and methods.
  • You should pick a paper for which authors have already published the codes and materials. If the algorithms are very common maybe an LSTM or logistic regression method based works, that should be okay for you! Try to learn from others and learn quality stuff, that’s the key.
  • By reading 10+ papers you will see a pattern in how authors explain their work and put a claim that their system/theory is better than others.
  • An empirical analysis of the problem
  • How to present an experiment and systemically support it. The most important part I think this portion of your work. You will learn to use most of the tools used for this research. That will enhance your capacity on building methods for solving new research problems.
  • Detailing on a systematic analysis and claim novelty

You must write a technical report on your own. This should be somewhat similar to the original paper but you can try to rephrase it. This is very very important and a must-do!

By following this whole procedure you will learn the standard and methods for doing quality research. Reproducing research is solely a learning journey. You can definitely mention this as a project in your resume. But if you want to publish something, you need to give something more. That’s your seeking of novelty as we talked earlier.

Where to get a task to reproduce?

I will talk in light of machine learning research. Finding a good problem is pretty simple in machine learning. Go to paperswithcode.com and you will see some categories of tasks there. Look at the subsection. Anything seems interesting? Pick some papers on that task and read them. Pick the most recent work and start working on that! Remember some big names at this stage. I must mention a few Google, Facebook, Deepmind, OpenAI, Salesforce, CMU, Stanford, and all the big names out there. Don’t be fooled by their work. Those are impossibly hard to reproduce due to a lack of huge computation and fancy methods. You should pick something which may sound easy. You will find codes and resources for most of the papers from paperswithcode.com. Don’t forget to check those codes while deciding on a task. Remember ending up is the ultimate goal here. You must care to finish your work. If you can finish something, that will motivate you to go for something big. Don’t fail yourself, please. It’s possible for you if you have a passion. Just be strategic enough.

When to start reproducing SOTA?

In my opinion will be wise to start at the end of the third year. It will not take more than six months to end up with a SOTA reproduced version of a work. So, you can use last six months for adding some novelty to your project.

How to introduce novelty in your work?

Upon reading lots of papers in certain domains and also recent works you will see that most of the papers end with a note on future works. Authors typically mention what works could be done to advance this task and also talk about the current limitations of their method(s). If you follow them and work hard, you should be able to come up with some ideas to push this work to a further level. That requires some supports from a mentor maybe but you might be able to do it all alone. One way to find mentors from a low resource institution is to collaborate with active researchers from other parts of the world. How about North America?

Where to find active researchers?

  • List the authors from those papers you are reading while reproducing. Go to google scholar and check their works. Try to come up with something so that you can write a mail to that person.
  • One nice fact is ML researchers are very active on Twitter. If you have listed 50 researchers, you may find half of them on twitter.
  • Pick someone who is young and hungry for research. Find someone who talks about diversity and inclusion. You are doing something from Bangladesh so that matters in the sense of diversity and advancing underrepresented groups. Even Google and Facebook like companies have funds for helping underrepresented students!
  • Contact with Bangladeshi Ph.D. students at North American school. Tell them you have an idea, you reproduced something, you wrote a paper and you have an idea. Mostly they will be happy to help you.

Why active researchers will care to mentor or collaborate with you?

  1. You have already learned the methods of doing research
  2. You know the tools for solving a problem
  3. You have reproduced and you can tell a story based on your proposal
  4. You are trying to get into research and some researcher on earth care for your passion

This may feel like a long way. Truth is the research is a long way. You should pass all this hassle and work hard to contribute something in a field. I find it’s an important cause if you follow this path and do not like it, it should help you decide on your career. Most students are not aware of what academic research looks like and suffer a lot once they go abroad without caring about the underlying concerns. So, undergraduate research could be a strong validity check on your interest!