|
a |
|
b/About_Dataset.md |
|
|
1 |
|
|
|
2 |
# **About the Dataset** |
|
|
3 |
|
|
|
4 |
This synthetic dataset generation is carefully designed to mimic real-world scenarios in Bangladesh, inspired by Data, News, Trends and Information foound on Govt/Non-govt organizations, particularly focusing on smoking behaviors. The features and values reflect practical and realistic trends inspired by research, demographics, and global data like FDA and WHO. Here's how the realism is embedded into the dataset: |
|
|
5 |
|
|
|
6 |
--- |
|
|
7 |
|
|
|
8 |
## **Realism of Dataset Features and Values** |
|
|
9 |
|
|
|
10 |
### 1. **Age Distribution** |
|
|
11 |
- **Generated Range**: 18 to 71 years. |
|
|
12 |
- **Realistic Justification**: |
|
|
13 |
- In Bangladesh, the average age of smokers starts from around 14 years and peaks in adulthood, declining slightly in later years. |
|
|
14 |
- By focusing on the 18–71 range, the dataset avoids edge cases (like children) while representing the majority of the smoking population realistically. |
|
|
15 |
|
|
|
16 |
--- |
|
|
17 |
|
|
|
18 |
### 2. **Gender Distribution** |
|
|
19 |
- **Categories**: "Male" and "Female." |
|
|
20 |
- **Probabilities**: 60% Male, 40% Female. |
|
|
21 |
- **Realistic Justification**: |
|
|
22 |
- According to data from Bangladesh and global trends, male smokers significantly outnumber female smokers. |
|
|
23 |
- Women smokers are less prevalent due to cultural and social factors, aligning the dataset with ground realities. |
|
|
24 |
|
|
|
25 |
--- |
|
|
26 |
|
|
|
27 |
### 3. **Professions** |
|
|
28 |
- **List of Professions**: |
|
|
29 |
```text |
|
|
30 |
"Student", "Agricultural Worker", "Teacher", "Healthcare Professional", |
|
|
31 |
"Textile Worker", "Shopkeeper/Trader", "Driver", "Construction Worker", |
|
|
32 |
"Housewife", "Freelancer/IT Professional", "Artisan/Craftsperson", |
|
|
33 |
"Unemployed/Dependent", "Others" |
|
|
34 |
``` |
|
|
35 |
- **Realistic Justification**: |
|
|
36 |
- The professions selected reflect a wide spectrum of Bangladesh’s workforce. For instance: |
|
|
37 |
- "Agricultural Worker" is highly relevant, as a significant portion of Bangladesh's workforce is involved in agriculture. |
|
|
38 |
- "Textile Worker" is a key category due to Bangladesh's large garment industry. |
|
|
39 |
- Inclusion of "Housewife" accounts for non-working women exposed to secondhand smoke at home. |
|
|
40 |
- Probabilities assigned to each profession are based on occupational prevalence in Bangladesh. |
|
|
41 |
|
|
|
42 |
--- |
|
|
43 |
|
|
|
44 |
### 4. **Geographical Data (Districts)** |
|
|
45 |
- **Districts**: All 64 districts of Bangladesh are included. |
|
|
46 |
- **Even Distribution**: |
|
|
47 |
- Every district has an equal probability of being selected. |
|
|
48 |
- This reflects national-level data without biasing towards urban or rural areas, enabling balanced geographic analysis. |
|
|
49 |
- **District IDs**: |
|
|
50 |
- Numerical IDs assigned to each district allow easier integration with mapping and visualization tools, such as choropleth maps. |
|
|
51 |
|
|
|
52 |
--- |
|
|
53 |
|
|
|
54 |
### 5. **Smoking Status** |
|
|
55 |
- **Categories**: `"Never Smoked"`, `"Former Smoker"`, `"Light Smoker"`, `"Medium Smoker"`, `"Heavy Smoker"`. |
|
|
56 |
- **Probabilities**: |
|
|
57 |
- "Never Smoked" has the highest probability (50%), reflecting non-smokers as the majority. |
|
|
58 |
- "Light Smoker," "Medium Smoker," and "Heavy Smoker" are distributed to mirror real-world prevalence, where heavy smoking is less common. |
|
|
59 |
- **Realistic Justification**: |
|
|
60 |
- These categories align with WHO's classification and reflect smoking habits in Bangladesh, where light to medium smoking is more prevalent than heavy smoking. |
|
|
61 |
|
|
|
62 |
--- |
|
|
63 |
|
|
|
64 |
### 6. **Age at Smoking Initiation** |
|
|
65 |
- **Generated Values**: |
|
|
66 |
- Randomly assigned between 10 years and the individual’s current age for smokers. |
|
|
67 |
- **Realistic Justification**: |
|
|
68 |
- Early initiation is a concern in Bangladesh, with many smokers starting in their teens. The dataset captures this trend while omitting cases of smoking starting before age 10, which are statistically negligible. |
|
|
69 |
|
|
|
70 |
--- |
|
|
71 |
|
|
|
72 |
### 7. **Exposure to Secondhand Smoke** |
|
|
73 |
- **Categories**: `"Yes"`, `"No"`. |
|
|
74 |
- **Probabilities**: 65% Yes, 35% No. |
|
|
75 |
- **Realistic Justification**: |
|
|
76 |
- Bangladesh has a high prevalence of secondhand smoke exposure, especially in households and workplaces. The dataset reflects this trend. |
|
|
77 |
|
|
|
78 |
--- |
|
|
79 |
|
|
|
80 |
### 8. **Awareness of Smoking Risks** |
|
|
81 |
- **Categories**: `"High"`, `"Moderate"`, `"Low"`, `"No Awareness"`. |
|
|
82 |
- **Probabilities**: |
|
|
83 |
- Higher probabilities for "Moderate" and "Low" awareness categories reflect the average public understanding in Bangladesh, where anti-smoking campaigns are increasing but not yet fully effective. |
|
|
84 |
|
|
|
85 |
--- |
|
|
86 |
|
|
|
87 |
### 9. **Pocket Money** |
|
|
88 |
- **Categories**: `"High"`, `"Medium"`, `"Low"`. |
|
|
89 |
- **Realistic Justification**: |
|
|
90 |
- Stratified based on professions, reflecting income disparity in Bangladesh. |
|
|
91 |
- For example: |
|
|
92 |
- "Healthcare Professional" and "Freelancer/IT Professional" are likely to have high pocket money. |
|
|
93 |
- "Agricultural Worker" and "Teacher" fall into the medium range. |
|
|
94 |
- "Students" and "Artisan/Craftsperson" typically have low pocket money. |
|
|
95 |
|
|
|
96 |
--- |
|
|
97 |
|
|
|
98 |
### 10. **Health Symptoms** |
|
|
99 |
- **Included Symptoms**: |
|
|
100 |
```text |
|
|
101 |
"Cough", "Shortness of Breath", "Chest Pain", "Fatigue", "Persistent Cough", "Wheezing" |
|
|
102 |
``` |
|
|
103 |
- **Linked Probabilities**: |
|
|
104 |
- Symptoms are correlated with smoking intensity, based on established medical research (e.g., WHO data on smoking-related diseases). |
|
|
105 |
- **Realistic Justification**: |
|
|
106 |
- Heavy smokers have a higher likelihood of experiencing these symptoms, while non-smokers rarely report them. This alignment reflects real health outcomes. |
|
|
107 |
|
|
|
108 |
--- |
|
|
109 |
|
|
|
110 |
### **Synthetic Yet Realistic** |
|
|
111 |
- Although this dataset is synthetic, it is designed to mimic real-world scenarios: |
|
|
112 |
- Probabilities are inspired by global data (FDA, WHO) and tailored to Bangladesh’s unique cultural, occupational, and demographic landscape. |
|
|
113 |
- This makes the dataset highly representative of smoking trends in the country. |
|
|
114 |
|
|
|
115 |
--- |
|
|
116 |
|
|
|
117 |
## **Please Note** |
|
|
118 |
This dataset provides a realistic and practical foundation for analyzing smoking behaviors in Bangladesh. The alignment of features and values with ground realities ensures its applicability in research, policymaking, and public health initiatives. |